Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

CrowdTruth Tutorial: Using the Crowd to Understand Ambiguity

442 views

Published on

CrowdTruth Tutorial at HCOMP-NL: Using the Crowd to Understand Ambiguity

Published in: Education
  • Be the first to comment

  • Be the first to like this

CrowdTruth Tutorial: Using the Crowd to Understand Ambiguity

  1. 1. CrowdTruth: Using the Crowd to Understand Ambiguity Anca Dumitrache, Lora Aroyo, Chris Welty http://crowdtruth.org
  2. 2. http://crowdtruth.org
  3. 3. 1759: computing date of Halley’s Comet return 1892: for the first time appears as term in NYT http://crowdtruth.org
  4. 4. • collective decisions of large groups of people • a group of error-prone decision-makers can be surprisingly good at picking the best choice. • for a binary decision, odds that majory of people in a crowd will pick the right answer is greater than the odds that any one of them will pick it on their own. • performance gets better as the size grows 1785 Marquis de Condorcet statistics for democracy “wisdom of crowds” http://crowdtruth.org
  5. 5. • asked 787 people to guess the weight of an ox • none got the right answer • their collective guess was almost perfect “wisdom of crowds” Sir Francis Galton http://crowdtruth.org
  6. 6. WWII Math Rosies 1942: Ballistics calculations and flight trajectories http://crowdtruth.org
  7. 7. “Who Wants to Be a Millionaire?” •help from individual expert or audience poll •majority of the audience right 91% of the time •individuals right only 65% of the time “wisdom of crowds” http://crowdtruth.org
  8. 8. humans solving computational problems too hard for computers alone Human Computing http://crowdtruth.org
  9. 9. “we treat human brains as processors in a distributed system each performing a small part of a massive computation” Human Computing Luis von Ahn http://crowdtruth.org
  10. 10. • 1979: amateur naturalists Don McCrimmon & Cal Smith first applied the concept of "citizen science” • 1980: Rick Bonney coined the term "citizen science” http://crowdtruth.org
  11. 11. • 2005: term “human computation” by Luis von Ahn • 2005: term “crowdsourcing” by Jeff Howe & Mark Robinson (Wired) • 2006: Games with the purpose by Luis von Ahn • 2006: “The Rise of the Crowdsourcing”, by Jeff Howe (Wired) • 2007: reCaptcha (company) • 2011: Duolingo, Luis von Ahn 2005 http://crowdtruth.org
  12. 12. 2009 http://crowdtruth.org
  13. 13. 2013 http://crowdtruth.org
  14. 14. Crowds Contributing in Cultural Heritage
  15. 15. Accurator ask the right crowd, enrich your collection
  16. 16. http://waisda.nl http://www.prestoprime.org/ @waisda
  17. 17. http://crowdtruth.org
  18. 18. today’s complex problem solving requires multiple perspectives http://crowdtruth.org
  19. 19. Crowdsourcing Myths http://crowdtruth.org
  20. 20. • Human annotators with domain knowledge provide better annotated data, e.g – if you want medical texts annotated for medical relations you need medical experts • But experts are expensive & don’t scale • Multiple perspectives on data can be useful, beyond what experts believe is salient or correct Crowdsourcing Myth: Ask the Expert What if the CROWD IS BETTER? http://CrowdTruth.org
  21. 21. What is the relation between the highlighted terms? He was the first physician to identify the relationship between HEMOPHILIA and HEMOPHILIC ARTHROPATHY. Experts Better? Crowd reads text literally - provide better examples to machine experts: cause crowd: no relation http://CrowdTruth.org
  22. 22. • rather than accepting disagreement as a natural property of semantic interpretation • traditionally, disagreement is considered a measure of poor quality in the annotation task because: – task is poorly defined or – annotators lack training This makes the elimination of disagreement a goal Crowdsourcing Myth: Disagreement is Bad What if it is GOOD? http://CrowdTruth.org
  23. 23. Disagreement Bad? Does each sentence express the TREAT relation? ANTIBIOTICS are the first line treatment for indications of TYPHUS. → agreement 95% Patients with TYPHUS who were given ANTIBIOTICS exhibited side-effects. → agreement 80% With ANTIBIOTICS in short supply, DDT was used during WWII to control the insect vectors of TYPHUS. → agreement 50% Disagreement can reflect the degree of clarity in a sentence http://CrowdTruth.org
  24. 24. Do gold standards capture all this? http://crowdtruth.org
  25. 25. http://crowdtruth.org ● single human experts → never perfectly correct ● gold standard is measured in inter-annotator agreement → what if disagreeing annotators are both right? ● gold standards → do not account for alternative interpretations & clarity
  26. 26. http://crowdtruth.org ● single human experts → never perfectly correct ● gold standard is measured in inter-annotator agreement → what if disagreeing annotators are both right? ● gold standards → do not account for alternative interpretations & clarity The fallacy of the “one truth” assumption that pervades computational semantics
  27. 27. “best collective decisions are result of disagreement, not consensus or compromise” James Surowiecki http://crowdtruth.org
  28. 28. Position disagreement is a signal and not noise can we harness it?
  29. 29. • Annotator disagreement is signal, not noise. • It is indicative of the variation in human semantic interpretation of signs • It can indicate ambiguity, vagueness, similarity, over-generality, etc, as well as quality CrowdTruth http://CrowdTruth.org
  30. 30. RelExTASKinCrowdFlower Patients with ACUTE FEVER and nausea could be suffering from INFLUENZA AH1N1 Is ACUTE FEVER – related to → INFLUENZA AH1N1? http://crowdtruth.org
  31. 31. 1 1 1 Worker Vector http://crowdtruth.org
  32. 32. 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 4 3 0 0 5 1 0 Sentence Vector http://crowdtruth.org
  33. 33. Unclear relationship between the two arguments reflected in the disagreement http://crowdtruth.org What’s in a sentence vector?
  34. 34. Clearly expressed relation between the two arguments reflected in the agreement http://crowdtruth.org What’s in a sentence vector?
  35. 35. Patients with ACUTE FEVER and nausea could be suffering from INFLUENZA AH1N1. measures how clearly a relation is expressed in a sentence Sentence-Relation Score http://crowdtruth.org Sentence vector # Workers 10 15 0 15 3 15 8 15 . . . causes treats partof sym ptom
  36. 36. Disagreement is a property of the entire system! How to model CrowdTruth? http://crowdtruth.org Sentence Quality Relation Quality Worker Quality
  37. 37. measures the clarity of one sentence Worker i vector Worker j vector AVG (Cosine) ∀ workers0 1 1 0 0 1 1 0 0 1 1 0 0 4 3 0 0 5 1 0 0 1 1 0 Sentence Quality http://crowdtruth.org 1 10
  38. 38. measures the clarity of one sentence weighted by worker & relation quality Worker i vector Worker j vector AVG (Cosine) ∀ workers0 1 1 0 0 1 1 0 0 1 1 0 0 4 3 0 0 5 1 0 0 1 1 0 Sentence Quality http://crowdtruth.org 1 10 weighted by relation quality weighted by worker quality
  39. 39. measures the performance of one worker Worker Quality (worker i) = Worker - Worker Agreement (worker i) ٠ Worker - Sentence Agreement (worker i) Worker Quality http://crowdtruth.org
  40. 40. measures the agreement between one worker & all other workers Worker i vector Worker j vector AVG (Cosine) ∀ worker j, j ≠ i 0 1 1 0 0 1 1 0 0 1 1 0 0 4 3 0 0 5 1 0 0 1 1 0 Worker-Worker Agreement http://crowdtruth.org 1 10
  41. 41. measures the agreement between one worker & all other workers weighted by sentence, relation & other workers quality Worker i vector Worker j vector AVG (Cosine) ∀ worker j, j ≠ i 0 1 1 0 0 1 1 0 0 1 1 0 0 4 3 0 0 5 1 0 0 1 1 0 Worker-Worker Agreement http://crowdtruth.org 1 10 weighted by relation quality weighted by other workers quality and sentence quality
  42. 42. measures the agreement between a worker & all sentences 0 1 1 0 0 4 3 0 0 5 1 0 Worker vector Sentence vector AVG (Cosine) ∀ sentences 0 1 1 0 0 1 1 0 0 1 1 0 Worker-Sentence Agreement http://crowdtruth.org
  43. 43. measures the agreement between a worker & all sentences weighted by sentence & relation quality 0 1 1 0 0 4 3 0 0 5 1 0 Worker’s sentence vector Sentence Vector AVG (Cosine) ∀ sentences 0 1 1 0 0 1 1 0 0 1 1 0 Worker-Sentence Agreement http://crowdtruth.org weighted by relation quality weighted by sentence quality
  44. 44. • measures the conditional probability that if a worker picks the relation in a sentence, then another worker will also pick it • weighted by sentence & worker quality Relation Quality http://crowdtruth.org Relation Quality (relation r) = P( worker i annotates relation r in sentence s | worker j annotates relation r in sentence s) ∀ workers i,j, ∀ sentences s
  45. 45. Get the metrics at: https://git.io/v5iTB
  46. 46. • Goals: ○ collecting a medical relation extraction gold standard ○ improve the performance of a relation extraction classifier • Approach: ○ crowdsource 900 medical sentences ○ measure disagreement with CrowdTruth metrics ○ train & evaluate classifier with CrowdTruth score CrowdTruth for medical relation extraction in practice http://crowdtruth.org
  47. 47. Medical RelEx Classifier Trained with Expert vs. CrowdTruth Sentence-Relation Score 0.642, p = 0:016 0.638 http://crowdtruth.org
  48. 48. Medical RelEx Classifier Trained with Expert vs. CrowdTruth Sentence-Relation Score 0.642, p = 0:016 0.638 crowd provides training data that is at least as good if not better than experts http://crowdtruth.org
  49. 49. # of Workers: Impact on Sentence-Relation Score http://crowdtruth.org
  50. 50. # of Workers: Impact on RelEx Model Performance only 54 sent. had 15 or more workers http://crowdtruth.org
  51. 51. • crowd performs just as well as medical experts • crowd is also cheaper • crowd is always available • using only a few annotators for ground truth is faulty • min 10 workers/sentence are needed for highest quality annotations • CrowdTruth = a solution to Clinical NLP Challenge: • lack of ground truth for training & benchmarking In summary: CrowdTruth for Medical Relation Extraction http://crowdtruth.org
  52. 52. • Annotator disagreement is signal, not noise. • It is indicative of the variation in human semantic interpretation of signs • It can indicate ambiguity, vagueness, similarity, over-generality, etc, as well as quality CrowdTruth http://CrowdTruth.org
  53. 53. crowdtruth.org metrics: git.io/v5iTB datasets: data.CrowdTruth.org @anca_dmtrch
  54. 54. How to model CrowdTruth? http://crowdtruth.org
  55. 55. How to model CrowdTruth? (Ogden & Richards, 1923) http://crowdtruth.org
  56. 56. How to model CrowdTruth? http://crowdtruth.org
  57. 57. How to model CrowdTruth? http://crowdtruth.org
  58. 58. Disagreement is useful How to model CrowdTruth? http://crowdtruth.org

×