Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

ESWC - PhD Symposium 2016

2,460 views

Published on

My presentation during the PhD Symposium at ESWC 2016.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

ESWC - PhD Symposium 2016

  1. 1. Machine-Crowd Annotation Workflow for Event Understanding across Collections & Domains Oana Inel Extended Semantic Web Conference PhD Symposium May 30th 2016
  2. 2. Too much information ... e.g., if you are interested in the topic of “whaling” 2
  3. 3. … and after a while it all looks the same it is difficult to form a global picture on a topic 3
  4. 4. … thus, content without context is difficult to process events can help create context around content 4
  5. 5. …, but events are not easy to deal with • Events are vague • Event semantics are difficult • Events can be viewed and interpreted from multiple perspectives and interpretations e.g. of participants interpretation: The mayor of the city called the celebration a success. • Events can be presented at different levels of granularities e.g. of spatial disagreement: The celebration took place in every city in the Netherlands. • People are not consistent in the way they talk about or use events e.g.: The celebration took place last week, fireworks shows were held everywhere. 5
  6. 6. … a lot of ground truth is needed to learn event specifics • Traditional ground truth collection doesn’t scale: • there is not really ‘one type of experts’ when it comes to events • the annotation guidelines for events are difficult to define • the annotation of events can be a tedious process • all of the above can result in high inter-annotator disagreement • Crowdsourcing could be an alternative • but is still not a robust & replicable approach 6
  7. 7. … let’s look at some examples According to department policy prosecutors must make a strong showing that lawyers' fees came from assets tainted by illegal profits before any attempts at seizure are made. The unit makes intravenous pumps used by hospitals and had more than $110 million in sales last year according to Advanced Medical. 7
  8. 8. … here is what experts annotate on these sentences [According] to department policy prosecutors must make a strong [showing] that lawyers' fees [came] from assets tainted by illegal profits before any [attempts] at [seizure] are [made]. The unit makes intravenous pumps used by hospitals and [had] more than $110 million in [sales] last year according to Advanced Medical. 8
  9. 9. … here is what the crowd annotates on them According to department policy prosecutors must make a [strong [showing]] that lawyers' fees [[came] from assets] [tainted] by illegal profits before any [attempts] at [seizure] are [made]. The unit [makes] intravenous pumps [used] by hospitals and [[had] more than $110 million in [sales]] last year according to Advanced Medical. 9
  10. 10. … here is what the machines can detect According to department policy prosecutors must [make] a strong showing that lawyers' fees [came] from assets [tainted] by illegal profits before any attempts at seizure are made. The unit [makes] intravenous pumps [used] by hospitals and [had] more than $110 million in sales last year according to Advanced Medical. 10
  11. 11. Research Questions • Can crowdsourcing help in improving event detection? • Can we provide reliable crowdsourced training data? • Can we optimize the crowdsourcing process by using results from NLP tools? • Can we achieve a replicable data collection process across different data types and use cases? 11
  12. 12. Current Hypothesis: Disagreement-based approach to crowdsource ground truth is reliable and produces quality results 12
  13. 13. Preliminary Results - Crowd vs. Experts ● 200 news snippets from TimeBank● 3019 tweets published in 2014 ● potential relevant tweets for events such as ‘whaling’, ‘Davos 2014’ among others CrowdTruth approach outperforms the-state-of-the-art crowdsourcing approaches such as single annotator and majority vote The crowd performs almost as good as the experts due to very linguistic-specialized guidelines for expert annotators13
  14. 14. Current Hypothesis: Disagreement-based approach to crowdsource ground truth can be optimised by using results from NLP tools 15
  15. 15. Preliminary Results - Hybrid Workflow ENTITY EXTRACTION EVENTS CROWDSOURCING AND LINKING TO CONCEPTS SEGMENTATION & KEYFRAMES LINKING EVENTS AND CONCEPTS TO KEYFRAMES diveplus.beeldengeluid.nl 16
  16. 16. Preliminary Results - Hybrid Workflow Outcome 17diveplus.beeldengeluid.nl
  17. 17. Approach: Disagreement is Signal Principles for disagreement-based crowdsourcing • Do not enforce agreement • Capture a multitude of views • Take advantage of existing tools, reuse their functionality This results in teaching machines to reason in the disagreement space 18
  18. 18. Overall Methodology 1. Instantiate the research methodology with specific data, domain • Video synopsis, news 2. Identify state-of-the-art IE approaches that can be used • NER tools for identifying events and their participating entities in the video synopsis 3. Evaluate IE approaches and identify their drawbacks • Poor performance in extracting events 4. Combine IE with crowdsourcing tasks in a complementary way • Use crowdsourcing for identifying the events and linking them with their participating entities 5. Evaluate crowdsourcing results with CrowdTruth disagreement-first approach • Evaluate the input unit, the workers and the annotations 6. Instantiate the same workflow with different data and/or different domain • Tweets, Twitter 7. Perform cross-domain analysis • Event extraction in video synopsis vs. event extraction in tweets 19
  19. 19. Project Websites http://CrowdTruth.org http://diveproject.beeldengeluid.nl Tools & Code http://dev.CrowdTruth.org http://github.com/CrowdTruth http://diveplus.beeldengeluid.nl Data http://data.crowdtruth.org http://data.dive.beeldengeluid.nl 20

×