Collaborative Data Management: How Crowdsourcing Can Help To Manage Data

7,163 views

Published on

Data management efforts such as MDM are a popular approach for high quality enterprise data. However, MDM can be heavily centralized and labour intensive, where the cost and effort can become prohibitively high. The concentration of data management and stewardship onto a few highly skilled individuals, like developers and data experts, can be a significant bottleneck. This talk explores how to effectively involving a wider community of users within collaborative data management activities. The bottom-up approach of involving crowds in the creation and management of data has been demonstrated by projects like Freebase, Wikipedia, and DBpedia. The talk is discusses how collaborative data management can be applied within an enterprise context using platforms such as Amazon Mechanical Turk, Mobile Works, and internal enterprise human computation platforms.

Topics covered include:
- Introduction to Crowdsourcing and Human Computation for Data Management
- Crowds vs. Communities, When to use them and why
- Push vs. Pull methods of crowdsourcing data management
- Setting up and running a collaborative data management process
- Modelling the expertise of communities

Published in: Technology, Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
7,163
On SlideShare
0
From Embeds
0
Number of Embeds
4,789
Actions
Shares
0
Downloads
46
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Collaborative Data Management: How Crowdsourcing Can Help To Manage Data

  1. 1. Collaborative Data Management: How Crowdsourcing Can Help To Manage Data Edward Curry Enterprise Data World 2013
  2. 2. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge n  Problems with Data ¨ Master Data Management n  Crowdsourcing n  Collaborative Data Management n  Setting up a CDM Process n  Future Directions Overview
  3. 3. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge The Problems with Data Knowledge Workers need: ¨  Access to the right data ¨  Confidence in that data Flawed data effects 25% of critical data in world’s top companies Data quality role in recent financial crisis: ¨  “Asset are defined differently in different programs” ¨  “Numbers did not always add up” ¨  “Departments do not trust each other’s figures” ¨  “Figures … not worth the pixels they were made of”
  4. 4. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge n  Master Data Management is a process that can improve data quality n  What is Data Quality? ¨ Desirable characteristics for information resource ¨ Described as a series of quality dimensions –  Discoverability, Accessibility, Timeliness, Completeness, Interpretation, Accuracy, Consistency, Provenance & Reputation Master Data Management
  5. 5. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge Data Quailty Master Data Management Profile Sources Define Mappings Cleans Enrich De-duplicate Define Rules Master Data Data Developer Data Steward Data Governance Business Users Applications Product DataProduct Data
  6. 6. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge Data Quality 6   ID PNAME PCOLOR PRICE APNR iPod Nano Red 150 APNS iPod Nano Silver 160 <Product  name=“iPod  Nano”>        <Items>                  <Item  code=“IPN890”>                              <price>150</price>                              <genera?on>5</genera?on>                  </Item>          </Items>   </Product>   Source A Source B Schema Difference? Data Developer APNR   iPod  Nano   Red   150   APNR   iPod  Nano   Silver   160   iPod  Nano   IPN890   150   5   Value Conflicts? Entity Duplication? Data Steward Business Users ? Technical Domain (Technical) Domain
  7. 7. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge n  Pros ¨  Can create a single version of truth ¨  Standardized information creation and management ¨  Improves data quality n  Cons ¨  Significant upfront costs and efforts ¨  Participation limited to few (mostly) technical experts ¨  Difficult to scale for large data sources –  Extended Enterprise e.g. partner, data vendors ¨  Small % of data under management (i.e. CRM, Product, …) Master Data Management
  8. 8. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge Enterprise Data Landscape The Managed 8 Reference data managed through well define policies and governance council Data directly managed by enterprise and its departments All data relevant to enterprise and its operationsThe Reality The Known MDM Enterprise Data Relevant External Data
  9. 9. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge CROWDSOURCING
  10. 10. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge Crowdsourcing Industry Landscape
  11. 11. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge n  Coordinating a crowd (a large group of workers)to do micro-work (small tasks) that solves problems (that computers or a single user can’t) n  A collection of mechanisms and associated methodologies for scaling and directing crowd activities to achieve goals n  Related Areas ¨  Collective Intelligence ¨  Social Computing ¨  Human Computation ¨  Data Mining Introduction to Crowdsourcing
  12. 12. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge n  Maskelyne 1760 ¨ Used human computers to created almanac of moon positions – Used for shipping/ navigation ¨ Quality assurance – Do calculations twice – Compare to third verifier When Computers Were Human
  13. 13. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge When Computers Were Human
  14. 14. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge Human ü Visual perception ü Visuospatial thinking ü Audiolinguistic ability ü Sociocultural awareness ü Creativity ü Domain knowledge Machine ü Large-scale data manipulation ü Collecting and storing large amounts of data ü Efficient data movement ü Bias-free analysis Human vs Machine Affordances
  15. 15. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge n  Computers cannot do the task n  Single person cannot do the task n  Work can be split into smaller tasks When to Crowdsource?
  16. 16. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge Tag a Tune
  17. 17. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge Peekaboom
  18. 18. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge Foldit
  19. 19. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge ReCaptcha n  OCR ¨  ~ 1% error rate ¨  20%-30% for 18th and 19th century books n  40 million ReCAPTCHAs every day” (2008) ¨  Fixing 40,000 books a day
  20. 20. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge Generic Architecture Workers Platform/Marketplace (Publish Task, Task Management) Requestors 1. 2. 4. 3.
  21. 21. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge Amazon Mechanical Turk
  22. 22. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge CrowdFlower
  23. 23. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge COLLABORATIVE DATA MANAGEMENT
  24. 24. •  Collabora?ve  knowledge   base  maintained  by   community  of  web  users   •  Users  create  en?ty  types   and  their  meta-­‐data   according  to  guidelines     •  Requires  administra?ve   approvals  for  schema   changes  by  end  users  
  25. 25. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge n  Collaboratively built by large community ¨  More than 19,000,000 articles, 270+ languages, 3,200,000+ articles in English ¨  More than 157,000 active contributors n  Accuracy and stylistic formality are equivalent to expert-based resources ¨  i.e. Columbia and Britannica encyclopedias n  WikiMeida ¨  Software behind Wikipedia ¨  Widely used inside organizations ¨  Intellipedia:16 U.S. Intelligence agencies ¨  Wiki Proteins: curated Protein data for knowledge discovery Wikipedia
  26. 26. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge n  DBPedia provides direct access to data ¨ Indirectly uses wiki as data curation platform ¨ Inherits massive volume of curated Wikipedia data ¨ 3.4 million entities and 1 billion RDF triples ¨ Comprehensive data infrastructure – Concept URIs – Definitions – Basic types DBPedia Knowledge base
  27. 27. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge A Bottom up Approach to MDM Engage  More  Human  Workers  to  Collabora4vely   Manage  Enterprise  Data   31  of  50   Collaborative Enterprise Data Management 10s-100s 10,000s-100,000sNumber of Participants Data Control Top-down Bottom-up MDM
  28. 28. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge Emerging Enterprise Data Landscape The Managed 8 Reference data managed through well define policies and governance council Data directly managed by enterprise and its departments All data relevant to enterprise and its operationsThe Reality The Known Enterprise Data Relevant External Data Collaboratively Managed MDM
  29. 29. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge Clean Data Algorithm + Crowd Developers Data Governance Internal Community External Crowd Data Sources Data Quality Algorithms Human Computation
  30. 30. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge Examples of CDM Tasks n  Understanding customer sentiment for launch of new product around the world. n  Implemented 24/7 sentiment analysis system with workers from around the world. n  Categorize millions of products on eBay’s catalog with accurate and complete attributes n  Combine the crowd with machine learning to create an affordable and flexible catalog quality system
  31. 31. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge n  Natural Language Processing ¨  Dialect Identification, Spelling Correction, Machine Translation, Word Similarity n  Computer Vision ¨  Image Similarity, Image Annotation/Analysis n  Classification ¨  Data attributes, Improving taxonomy, search results n  Verification ¨  Entity consolidation, de-duplicate, cross-check, validate data n  Enrichment ¨  Judgments, annotation Examples of CDM Tasks
  32. 32. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge SETTING UP A CDM PROCESS
  33. 33. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge Core Design Questions of CDM Goal What Why IncentivesWhoWorkers How Process Malone, T. W., Laubacher, R., & Dellarocas, C. N. Harnessing crowds: Mapping the genome of collective intelligence. MIT Sloan Research Paper 4732-09, (2009).
  34. 34. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge n  Hierarchy (Assignment) ¨ Someone in authority assigns a particular person or group of people to perform the task ¨ Within the Enterprise n  Crowd (Choice) ¨ Anyone in a large group who choses to do so ¨ Internal or External Crowds Who is doing it? (Workers)
  35. 35. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge n  Motivation ¨  Money ($$££) ¨  Glory (reputation/prestige) ¨  Love (altruism, socialize, enjoyment) ¨  Unintended by-product (e.g. re-Captcha, captured in workflow) ¨  Self-serving resources (e.g. Wikipedia, product/customer data) n  Determine pay and time for each task ¨  Marketplace: Delicate balance –  Money does not improve quality but can increase participation ¨  Internal Hierarchy: Engineering opportunities for recognition –  Performance review, prizes for top contributors, badges, leaderboards, etc. Why are they doing it? (Incentives)
  36. 36. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge Effect of Payment on Quality n  Cost does not affect quality [Mason and Watts, 2009, AdSafe] n  Similar results for bigger tasks [Ariely et al, 2009] [Panos Ipeirotis. WWW2011 tutorial]
  37. 37. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge n  Creation Tasks ¨ Create/Generate ¨ Find ¨ Improve/ Edit / Fix n  Decision (Vote) Tasks ¨ Accept / Reject ¨ Thumbs up / Thumbs Down ¨ Vote for Best What is being done? (Goal)
  38. 38. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge n  Tasks integrated in normal workflow of those creating and managing data ¨ Simple as vetting or “rating” results of algorithm n  Task Design ¨ Task Interface ¨ Task Assignment/Routing ¨ Task Quality Assurance How is it being done? (How)
  39. 39. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge Task Design 43 * Edith Law and Luis von Ahn, Human Computation - Core Research Questions and State of the Art Input Output Task Router before computation Output Aggregation after computation Task Interface during computation
  40. 40. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge Pull Routing n  Workers seek tasks and assign to themselves ¨  Search and Discovery of tasks support by platform ¨  Task Recommendation ¨  Peer Routing Workers Tasks Select Result Algorithm Search & Browse Interface Result
  41. 41. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge Push Routing n  System assigns tasks to workers based on: ¨  Past performance ¨  Expertise ¨  Cost ¨  Latency 45 Workers Tasks Assign Result Assign Algorithm Task Interface * www.mobileworks.com Result
  42. 42. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge n  Redundancy: Quorum Votes ¨  Replicate the task (i.e. 3 times) ¨  Use majority voting to determine right value (% agreement) ¨  Weighted majority vote n  Gold Data / Honey Pots ¨  Inject trap question to test quality ¨  Worker fatigue check (habit of saying no all the time) n  Estimation of Worker Quality ¨  Redundancy plus gold data n  Qualification Test ¨  Use test tasks to determine users ability for such tasks Managing Task Quality Assurance
  43. 43. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge n  Task Management ¨ Task assignment, payment, routing –  Optimizing for Cost, Quality, Completion Time n  Human–Computer Interaction ¨ Payment / incentives ¨ User interface and interaction design ¨ Worker reputation, recruitment, retention n  Quality Control ¨ Trust, reliability, spam detection, consensus Future Directions
  44. 44. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge n  Collaborative Data Management ¨  Emerging trend for data management in the Enterprise. ¨  Crowdsourcing + Micro Tasks ¨  A number of emerging platform to assist Summary Data Quality Algorithms Human Computation Clean DataDirty Data
  45. 45. BIG Big Data Public Private Forum THE BIG PROJECT Overall objective Bringing the necessary stakeholders into a self-sustainable industry-led initiative, which will greatly contribute to enhance the EU competitiveness taking full advantage of Big Data technologies. Work at technical, business and policy levels, shaping the future through the positioning of IIM and Big Data specifically in Horizon 2020. BIGBig Data Public Private Forum
  46. 46. BIG Big Data Public Private Forum Key facts about BIG-project ▶ Type of project: CSA ▶ Project start date: September 2012 ▶ Duration: 26 months ▶ Call: FP7-ICT-2011-8 ▶ Effort: 552,5 PM ▶ Budget: 3,038 M€ ▶ Max EC contribution: 2,499 M€ ▶ Consortium: 11 partners
  47. 47. BIG Big Data Public Private Forum BIG: PROJECT STRUCTURE Data  acquisition Data   analysis Data   curation Data   storage Data   usage Health Public  Sector Telco,  Media  &   Entertainment Finance  &   insurance Manufacturing,   Retail,  Energy,   Transport Value Chain • Structured  data • Unstructured  Data • Event  processing • Sensors  networks • Streams • Data  preprocessing • Semantic  analysis • Sentiment  analysis • Other  features   analysis • Data  correlation • Trust • Provenance • Data  augmentation • Data  validation • RDBMS  limitations   • NOSQL • Cloud  storage • Decision  support • Decision  making • Automatic  steps • Domain-­‐specific   usage Technicalareas SupplyNeeds Industrydriven working groups
  48. 48. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge Edward is a research scientist at the Digital Enterprise Research Institute. His areas of research include green IT/IS, energy informatics, linked data, integrated reporting, and cloud computing. He has worked extensively with industry and government advising on the adoption patterns, practicalities and benefits of new technologies. He has published in leading journals and books, and has spoken at international conferences including the MIT CIO Symposium. About the Presenter URL: www.edwardcurry.org Email: edcurry@acm.org Twitter: @EdwardACurry Slides: slideshare.net/edwardcurry
  49. 49. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge n  Big Data & Data Quality ¨  S. Lavalle, E. Lesser, R. Shockley, M. S. Hopkins, and N. Kruschwitz, “Big Data, Analytics and the Path from Insights to Value,” MIT Sloan Management Review, vol. 52, no. 2, pp. 21–32, 2011. ¨  A. Haug and J. S. Arlbjørn, “Barriers to master data quality,” Journal of Enterprise Information Management, vol. 24, no. 3, pp. 288–303, 2011. ¨  R. Silvola, O. Jaaskelainen, H. Kropsu-Vehkapera, and H. Haapasalo, “Managing one master data – challenges and preconditions,” Industrial Management & Data Systems, vol. 111, no. 1, pp. 146–162, 2011. ¨  E. Curry, S. Hasan, and S. O’Riain, “Enterprise Energy Management using a Linked Dataspace for Energy Intelligence,” in Second IFIP Conference on Sustainable Internet and ICT for Sustainability, 2012. ¨  D. Loshin, Master Data Management. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2008. ¨  B. Otto and A. Reichert, “Organizing Master Data Management: Findings from an Expert Survey,” in Proceedings of the 2010 ACM Symposium on Applied Computing - SAC ’10, 2010, pp. 106–110. Selected References 53
  50. 50. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge n  Collective Intelligence, Crowdsourcing & Human Computation ¨  A. Doan, R. Ramakrishnan, and A. Y. Halevy, “Crowdsourcing systems on the World- Wide Web,” Communications of the ACM, vol. 54, no. 4, p. 86, Apr. 2011. ¨  E. Law and L. von Ahn, “Human Computation,” Synthesis Lectures on Artificial Intelligence and Machine Learning, vol. 5, no. 3, pp. 1–121, Jun. 2011. ¨  M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin, “CrowdDB : Answering Queries with Crowdsourcing,” in Proceedings of the 2011 international conference on Management of data - SIGMOD ’11, 2011, p. 61. ¨  P. Wichmann, A. Borek, R. Kern, P. Woodall, A. K. Parlikad, and G. Satzger, “Exploring the ‘Crowd’ as Enabler of Better Information Quality,” in Proceedings of the 16th International Conference on Information Quality, 2011, pp. 302–312. ¨  Winter A. Mason, Duncan J. Watts: Financial incentives and the "performance of crowds". SIGKDD Explorations (SIGKDD) 11(2):100-108 (2009) ¨  Panos Ipeirotis. Managing Crowdsourced Human Computation, WWW2011 Tutorial ¨  O. Alonso & M. Lease. Crowdsourcing 101: Putting the WSDM of Crowds to Work for You, WSDM Hong Kong 2011. ¨  When Computers Were Human: http://www.youtube.com/watch?v=YwqltwvPnkw Selected References 54
  51. 51. Digital Enterprise Research Institute www.deri.ie Enabling networked knowledge n  Collaborative Data Management ¨  E. Curry, A. Freitas, and S. O. Riain, “The Role of Community-Driven Data Curation for Enterprises,” in Linking Enterprise Data, D. Wood, Ed. Boston, MA: Springer US, 2010, pp. 25–47. ¨  ul Hassan, U., O’Riain, S., and Curry, E. 2012. “Towards Expertise Modelling for Routing Data Cleaning Tasks within a Community of Knowledge Workers,” In 17th International Conference on Information Quality (ICIQ 2012), Paris, France. ¨  ul Hassan, U., O’Riain, S., and Curry, E. 2013. “Effects of Expertise Assessment on the Quality of Task Routing in Human Computation,” In 2nd International Workshop on Social Media for Crowdsourcing and Human Computation, Paris, France. ¨  ul Hassan, U., O’Riain, S., and Curry, E. 2012. “Leveraging Matching Dependencies for Guided User Feedback in Linked Data Applications,” In 9th International Workshop on Information Integration on the Web (IIWeb2012) Scottsdale, Arizona,: ACM. Selected References 55

×