SlideShare a Scribd company logo
1 of 26
Download to read offline
Crowdsourcing semantic data
               management: challenges and
                      opportunities
                                        Elena Simperl
                        Karlsruhe Institute of Technology, Germany
            Talk at WIMS 2012: International Conference on Web Intelligence, Mining and Semantics
                                         Craiova, Romania; June 2012




6/18/2012                                      www.insemtives.eu                                    1
Semantic technologies are all
    about automation
• Many tasks in semantic
  data management
  fundamentally rely on
  human input
  – Modeling a domain
  – Integrating data sources
    originating from
    different contexts
  – Producing semantic
    markup for various types
    of digital artifacts
  – ...
Great challenges
• Understand what drives users to participate in
  semantic data management tasks
• Design semantic systems reflecting this
  understanding to reach critical mass and
  sustained engagement
Great opportunities
Incentives and motivators
• What motivates people to engage with an application?
• Which rewards are effective and when?

• Motivation is the driving force that makes humans achieve
  their goals
• Incentives are ‘rewards’ assigned by an external ‘judge’ to a
  performer for undertaking a specific task
   – Common belief (among economists): incentives can be
     translated into a sum of money for all practical purposes
• Incentives can be related to extrinsic and intrinsic
  motivations
Incentives and motivators (2)
• Successful volunteer crowdsourcing is difficult to
  predict or replicate
   – Highly context-specific
   – Not applicable to arbitrary tasks


• Reward models often easier to study and control
  (if performance can be reliably measured)
   – Different models: pay-per-time, pay-per-unit, winner-takes-it-
     all, …
   – Not always easy to abstract from social aspects (free-riding,
     social pressure)
   – May undermine intrinsic motivation
TURN WORK INTO PLAY
GWAPs and gamification
• GWAPs: human computation disguised as casual games
• Gamification/game mechanics: integrate game elements
  to applications
   – Accelerated feedback cycles
       • Annual performance appraisals vs immediate feedback to maintain
         engagement
   – Clear goals and rules of play
       • Players feel empowered to achieve goals vs fuzzy, complex system of
         rules in real-world
   – Compelling narrative
       • Gamification builds a narrative that engages players to participate and
         achieve the goals of the activity
   – But in the end it’s about what tasks users
   want to get better at
Examples




           www.insemtives.eu   9
Example: ontology building




6/18/2012              www.insemtives.eu   10
Example: relationship finding
Example: ontology alignment




          www.insemtives.eu   12
Example: video annotation




          www.insemtives.eu   13
Challenges
• Not all tasks are amenable to gamification
    –   Work is decomposable into simpler (nested) tasks
    –   Performance is measurable according to an obvious rewarding scheme
    –   Skills can be arranged in a smooth learning curve
    –   Player’s retention vs repetitive tasks
• Not all domains are equally appealing
    – Application domain needs to attract a large user base
    – Knowledge corpus has to be large-enough to avoid repetitions
    – Quality of automatically computed input may hamper game experience
• Attracting and retaining players
    – You need a critical mass of players to validate the results
    – Advertisement, building upon an existing user base
    – Continuous development
OUTSOURCING TO THE CROWD
Microtask crowdsourcing
• Work decomposed into small Human Intelligence Tasks
  (HITs) executed independently and in parallel in return
  for a monetary reward.
• Successfully applied to transcription, classification, and
  content generation, data collection, image tagging,
  website feedback, usability tests…
• Increasingly used by academia for evaluation purposes
• Extensions for quality assurance, complex workflows,
  resource management, vertical domains…
Examples




Mason & Watts: Financial incentives and the performance of the crowds, HCOMP 2009.
Crowdsourcing ontology alignment
• Experiments using
  Amazon‘s Mechanical
  Turk and CrowdFlower
  and established
  benchmarks
• Enhancing the results of
  automatic techniques
• Fast, accurate and cost-
  effective
Challenges
• Not all tasks can be addressed by microtask
  platforms
  – Routine work requiring common knowledge,
    decomposable into simpler, independent sub-
    tasks, performance easily measurable
• Ongoing research in task design, quality
  assurance (spam), estimated time of
  completion…
Crowdsourcing query processing
           Give me the German names of all commercial airports in Baden-
           Württemberg, ordered by their most informative description.


„Retrieve the labels in German of commercial airports located
in Baden-Württemberg, ordered by the better human-readable
description of the airport given in the comment“.


• This query cannot be optimally answered automatically
   – Incorrect/missing classification of entities (e.g. classification as airports
     instead of commercial airports)
   – Missing information in data sets (e.g. German labels)
   – It is not possible to optimally perform subjective operations (e.g. comparisons
     of pictures or NL comments)
An integrated solution
• Integral part of Linked Data
  management platforms
    – At design time application
      developer specifies which data
      portions workers can process and
      via which types of HITs
    – At run time
        • The system materializes the data
        • Workers process it
        • Data and application are updated to
          reflect crowdsourcing results
• Formal, declarative description of
  the data and tasks using SPARQL
  patterns as a basis for the
  automatic design of HITs
• Reducing the number of tasks
  through automatic reasoning
Example using SPARQL
„Retrieve the labels in German of commercial airports
located in Baden-Württemberg, ordered by the better
human-readable description of the airport given in the
comment“.
                                      Classification
SPARQL Query:                     1
SELECT ?label WHERE {
  ?x a metar:CommercialHubAirport;
    rdfs:label ?label;
    rdfs:comment ?comment .                        Identity resolution
  ?x geonames:parentFeature ?z .                                       2
  ?z owl:sameAs <http://dbpedia.org/resource/Baden-Wuerttemberg>
.                            3 Missing Information
  FILTER (LANG(?label) = "de")
} ORDER BY CROWD(?comment, "Better description of %x")  4 Ordering
HITs design: Classification
  • It is not always possible to automatically infer classification
    from the properties.
  • Example: Retrieve the names (labels) of METAR stations that correspond to
     commercial airports.
SELECT ?label WHERE {
  ?station a metar:CommercialHubAirport;
    rdfs:label ?label .}
 Input:   {?station a metar:Station;
             rdfs:label ?label;
             wgs84:lat ?lat;
             wgs84:long ?long}

 Output: {?station a ?type.
          ?type rdfs:subClassOf metar:Station}
HITs design: Ordering
  • Orderings defined via less straightforward built-ins; for instance,
    the ordering of pictorial representations of entities.
  • SPARQL extension: ORDER BY CROWD
  • Example: Retrieves all airports and their pictures, and the pictures should be
         ordered according to the more representative image of the given airport.

SELECT ?airport ?picture WHERE {
  ?airport a metar:Airport;
    foaf:depiction ?picture .
} ORDER BY CROWD(?picture,
"Most representative image for %airport")

Input:     {?airport foaf:depiction ?x, ?y}

Output: {{(?x ?y) a rdf:List} UNION {(?y ?x) a rdf:List}}
Challenges
• Decomposition of queries
    – Query optimisation obfuscates what is used and should involve costs for
      human tasks
• Query execution and caching
    – Naively we can materialise HIT results into datasets
    – How to deal with partial coverage and dynamic datasets
• Appropriate level of granularity for HITs design for specific SPARQL
  constructs and typical functionality of Linked Data management
  components
• Optimal user interfaces of graph-like content
    – (Contextual) Rendering of LOD entities and tasks
• Pricing and workers’ assignment
    – Can we connect the end-users of an application and their wish for
      specific data to be consumed with the payment of workers and
      prioritization of HITs?
    – Dealing with spam / gaming
Thank you
  e: elena.simperl@kit.edu, t: @esimperl

Publications available at www.insemtives.org

   Team: Maribel Acosta, Barry Norton, Katharina Siorpaes,
       Stefan Thaler, Stephan Wölger and many others

More Related Content

Similar to Wims2012

OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine Learning
Paco Nathan
 
Cloud Migration: Azure acceleration with CAST Highlight
Cloud Migration: Azure acceleration with CAST HighlightCloud Migration: Azure acceleration with CAST Highlight
Cloud Migration: Azure acceleration with CAST Highlight
CAST
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup Group
Doug Needham
 

Similar to Wims2012 (20)

Insemtives swat4ls 2012
Insemtives swat4ls 2012Insemtives swat4ls 2012
Insemtives swat4ls 2012
 
MODEL-DRIVEN ENGINEERING (MDE) in Practice
MODEL-DRIVEN ENGINEERING (MDE) in PracticeMODEL-DRIVEN ENGINEERING (MDE) in Practice
MODEL-DRIVEN ENGINEERING (MDE) in Practice
 
OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine Learning
 
Lecture 3 GORE.pptx
Lecture 3 GORE.pptxLecture 3 GORE.pptx
Lecture 3 GORE.pptx
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FS
 
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
Makine Öğrenmesi, Yapay Zeka ve Veri Bilimi Süreçlerinin Otomatikleştirilmesi...
 
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAML
 
IBM Rational HATS Overview 2013
IBM Rational HATS Overview 2013IBM Rational HATS Overview 2013
IBM Rational HATS Overview 2013
 
Analytics what to look for sustaining your growing business-
Analytics   what to look for sustaining your growing business-Analytics   what to look for sustaining your growing business-
Analytics what to look for sustaining your growing business-
 
MongoDB and In-Memory Computing
MongoDB and In-Memory ComputingMongoDB and In-Memory Computing
MongoDB and In-Memory Computing
 
Machine Learning for .NET Developers - ADC21
Machine Learning for .NET Developers - ADC21Machine Learning for .NET Developers - ADC21
Machine Learning for .NET Developers - ADC21
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
 
Neo4j GraphTour New York_EY Presentation_Michael Moore
Neo4j GraphTour New York_EY Presentation_Michael MooreNeo4j GraphTour New York_EY Presentation_Michael Moore
Neo4j GraphTour New York_EY Presentation_Michael Moore
 
Multi-modal sources for predictive modeling using deep learning
Multi-modal sources for predictive modeling using deep learningMulti-modal sources for predictive modeling using deep learning
Multi-modal sources for predictive modeling using deep learning
 
Plug-n-Play Knowledge Management
Plug-n-Play Knowledge ManagementPlug-n-Play Knowledge Management
Plug-n-Play Knowledge Management
 
Webinar: Scaling MongoDB
Webinar: Scaling MongoDBWebinar: Scaling MongoDB
Webinar: Scaling MongoDB
 
Cloud Migration: Azure acceleration with CAST Highlight
Cloud Migration: Azure acceleration with CAST HighlightCloud Migration: Azure acceleration with CAST Highlight
Cloud Migration: Azure acceleration with CAST Highlight
 
Leveraging Machine Learning for Competitive Advantage at Search Party
Leveraging Machine Learning for Competitive Advantage at Search PartyLeveraging Machine Learning for Competitive Advantage at Search Party
Leveraging Machine Learning for Competitive Advantage at Search Party
 
Leveraging Machine Learning for Competitive Advantage by Dylan Hogg - Search ...
Leveraging Machine Learning for Competitive Advantage by Dylan Hogg - Search ...Leveraging Machine Learning for Competitive Advantage by Dylan Hogg - Search ...
Leveraging Machine Learning for Competitive Advantage by Dylan Hogg - Search ...
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup Group
 

More from Elena Simperl

One does not simply crowdsource the Semantic Web: 10 years with people, URIs,...
One does not simply crowdsource the Semantic Web: 10 years with people, URIs,...One does not simply crowdsource the Semantic Web: 10 years with people, URIs,...
One does not simply crowdsource the Semantic Web: 10 years with people, URIs,...
Elena Simperl
 

More from Elena Simperl (20)

This talk was not generated with ChatGPT: how AI is changing science
This talk was not generated with ChatGPT: how AI is changing scienceThis talk was not generated with ChatGPT: how AI is changing science
This talk was not generated with ChatGPT: how AI is changing science
 
Knowledge graph use cases in natural language generation
Knowledge graph use cases in natural language generationKnowledge graph use cases in natural language generation
Knowledge graph use cases in natural language generation
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
The web of data: how are we doing so far
The web of data: how are we doing so farThe web of data: how are we doing so far
The web of data: how are we doing so far
 
What Wikidata teaches us about knowledge engineering
What Wikidata teaches us about knowledge engineeringWhat Wikidata teaches us about knowledge engineering
What Wikidata teaches us about knowledge engineering
 
Open government data portals: from publishing to use and impact
Open government data portals: from publishing to use and impactOpen government data portals: from publishing to use and impact
Open government data portals: from publishing to use and impact
 
Ten myths about knowledge graphs.pdf
Ten myths about knowledge graphs.pdfTen myths about knowledge graphs.pdf
Ten myths about knowledge graphs.pdf
 
What Wikidata teaches us about knowledge engineering
What Wikidata teaches us about knowledge engineeringWhat Wikidata teaches us about knowledge engineering
What Wikidata teaches us about knowledge engineering
 
Data commons and their role in fighting misinformation.pdf
Data commons and their role in fighting misinformation.pdfData commons and their role in fighting misinformation.pdf
Data commons and their role in fighting misinformation.pdf
 
Are our knowledge graphs trustworthy?
Are our knowledge graphs trustworthy?Are our knowledge graphs trustworthy?
Are our knowledge graphs trustworthy?
 
The web of data: how are we doing so far?
The web of data: how are we doing so far?The web of data: how are we doing so far?
The web of data: how are we doing so far?
 
Crowdsourcing and citizen engagement for people-centric smart cities
Crowdsourcing and citizen engagement for people-centric smart citiesCrowdsourcing and citizen engagement for people-centric smart cities
Crowdsourcing and citizen engagement for people-centric smart cities
 
Pie chart or pizza: identifying chart types and their virality on Twitter
Pie chart or pizza: identifying chart types and their virality on TwitterPie chart or pizza: identifying chart types and their virality on Twitter
Pie chart or pizza: identifying chart types and their virality on Twitter
 
High-value datasets: from publication to impact
High-value datasets: from publication to impactHigh-value datasets: from publication to impact
High-value datasets: from publication to impact
 
The story of Data Stories
The story of Data StoriesThe story of Data Stories
The story of Data Stories
 
The human face of AI: how collective and augmented intelligence can help sol...
The human face of AI:  how collective and augmented intelligence can help sol...The human face of AI:  how collective and augmented intelligence can help sol...
The human face of AI: how collective and augmented intelligence can help sol...
 
Qrowd and the city: designing people-centric smart cities
Qrowd and the city: designing people-centric smart citiesQrowd and the city: designing people-centric smart cities
Qrowd and the city: designing people-centric smart cities
 
One does not simply crowdsource the Semantic Web: 10 years with people, URIs,...
One does not simply crowdsource the Semantic Web: 10 years with people, URIs,...One does not simply crowdsource the Semantic Web: 10 years with people, URIs,...
One does not simply crowdsource the Semantic Web: 10 years with people, URIs,...
 
Qrowd and the city
Qrowd and the cityQrowd and the city
Qrowd and the city
 
Inclusive cities: a crowdsourcing approach
Inclusive cities: a crowdsourcing approachInclusive cities: a crowdsourcing approach
Inclusive cities: a crowdsourcing approach
 

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 

Wims2012

  • 1. Crowdsourcing semantic data management: challenges and opportunities Elena Simperl Karlsruhe Institute of Technology, Germany Talk at WIMS 2012: International Conference on Web Intelligence, Mining and Semantics Craiova, Romania; June 2012 6/18/2012 www.insemtives.eu 1
  • 2. Semantic technologies are all about automation • Many tasks in semantic data management fundamentally rely on human input – Modeling a domain – Integrating data sources originating from different contexts – Producing semantic markup for various types of digital artifacts – ...
  • 3. Great challenges • Understand what drives users to participate in semantic data management tasks • Design semantic systems reflecting this understanding to reach critical mass and sustained engagement
  • 5. Incentives and motivators • What motivates people to engage with an application? • Which rewards are effective and when? • Motivation is the driving force that makes humans achieve their goals • Incentives are ‘rewards’ assigned by an external ‘judge’ to a performer for undertaking a specific task – Common belief (among economists): incentives can be translated into a sum of money for all practical purposes • Incentives can be related to extrinsic and intrinsic motivations
  • 6. Incentives and motivators (2) • Successful volunteer crowdsourcing is difficult to predict or replicate – Highly context-specific – Not applicable to arbitrary tasks • Reward models often easier to study and control (if performance can be reliably measured) – Different models: pay-per-time, pay-per-unit, winner-takes-it- all, … – Not always easy to abstract from social aspects (free-riding, social pressure) – May undermine intrinsic motivation
  • 8. GWAPs and gamification • GWAPs: human computation disguised as casual games • Gamification/game mechanics: integrate game elements to applications – Accelerated feedback cycles • Annual performance appraisals vs immediate feedback to maintain engagement – Clear goals and rules of play • Players feel empowered to achieve goals vs fuzzy, complex system of rules in real-world – Compelling narrative • Gamification builds a narrative that engages players to participate and achieve the goals of the activity – But in the end it’s about what tasks users want to get better at
  • 9. Examples www.insemtives.eu 9
  • 10. Example: ontology building 6/18/2012 www.insemtives.eu 10
  • 12. Example: ontology alignment www.insemtives.eu 12
  • 13. Example: video annotation www.insemtives.eu 13
  • 14. Challenges • Not all tasks are amenable to gamification – Work is decomposable into simpler (nested) tasks – Performance is measurable according to an obvious rewarding scheme – Skills can be arranged in a smooth learning curve – Player’s retention vs repetitive tasks • Not all domains are equally appealing – Application domain needs to attract a large user base – Knowledge corpus has to be large-enough to avoid repetitions – Quality of automatically computed input may hamper game experience • Attracting and retaining players – You need a critical mass of players to validate the results – Advertisement, building upon an existing user base – Continuous development
  • 16. Microtask crowdsourcing • Work decomposed into small Human Intelligence Tasks (HITs) executed independently and in parallel in return for a monetary reward. • Successfully applied to transcription, classification, and content generation, data collection, image tagging, website feedback, usability tests… • Increasingly used by academia for evaluation purposes • Extensions for quality assurance, complex workflows, resource management, vertical domains…
  • 17. Examples Mason & Watts: Financial incentives and the performance of the crowds, HCOMP 2009.
  • 18. Crowdsourcing ontology alignment • Experiments using Amazon‘s Mechanical Turk and CrowdFlower and established benchmarks • Enhancing the results of automatic techniques • Fast, accurate and cost- effective
  • 19. Challenges • Not all tasks can be addressed by microtask platforms – Routine work requiring common knowledge, decomposable into simpler, independent sub- tasks, performance easily measurable • Ongoing research in task design, quality assurance (spam), estimated time of completion…
  • 20. Crowdsourcing query processing Give me the German names of all commercial airports in Baden- Württemberg, ordered by their most informative description. „Retrieve the labels in German of commercial airports located in Baden-Württemberg, ordered by the better human-readable description of the airport given in the comment“. • This query cannot be optimally answered automatically – Incorrect/missing classification of entities (e.g. classification as airports instead of commercial airports) – Missing information in data sets (e.g. German labels) – It is not possible to optimally perform subjective operations (e.g. comparisons of pictures or NL comments)
  • 21. An integrated solution • Integral part of Linked Data management platforms – At design time application developer specifies which data portions workers can process and via which types of HITs – At run time • The system materializes the data • Workers process it • Data and application are updated to reflect crowdsourcing results • Formal, declarative description of the data and tasks using SPARQL patterns as a basis for the automatic design of HITs • Reducing the number of tasks through automatic reasoning
  • 22. Example using SPARQL „Retrieve the labels in German of commercial airports located in Baden-Württemberg, ordered by the better human-readable description of the airport given in the comment“. Classification SPARQL Query: 1 SELECT ?label WHERE { ?x a metar:CommercialHubAirport; rdfs:label ?label; rdfs:comment ?comment . Identity resolution ?x geonames:parentFeature ?z . 2 ?z owl:sameAs <http://dbpedia.org/resource/Baden-Wuerttemberg> . 3 Missing Information FILTER (LANG(?label) = "de") } ORDER BY CROWD(?comment, "Better description of %x") 4 Ordering
  • 23. HITs design: Classification • It is not always possible to automatically infer classification from the properties. • Example: Retrieve the names (labels) of METAR stations that correspond to commercial airports. SELECT ?label WHERE { ?station a metar:CommercialHubAirport; rdfs:label ?label .} Input: {?station a metar:Station; rdfs:label ?label; wgs84:lat ?lat; wgs84:long ?long} Output: {?station a ?type. ?type rdfs:subClassOf metar:Station}
  • 24. HITs design: Ordering • Orderings defined via less straightforward built-ins; for instance, the ordering of pictorial representations of entities. • SPARQL extension: ORDER BY CROWD • Example: Retrieves all airports and their pictures, and the pictures should be ordered according to the more representative image of the given airport. SELECT ?airport ?picture WHERE { ?airport a metar:Airport; foaf:depiction ?picture . } ORDER BY CROWD(?picture, "Most representative image for %airport") Input: {?airport foaf:depiction ?x, ?y} Output: {{(?x ?y) a rdf:List} UNION {(?y ?x) a rdf:List}}
  • 25. Challenges • Decomposition of queries – Query optimisation obfuscates what is used and should involve costs for human tasks • Query execution and caching – Naively we can materialise HIT results into datasets – How to deal with partial coverage and dynamic datasets • Appropriate level of granularity for HITs design for specific SPARQL constructs and typical functionality of Linked Data management components • Optimal user interfaces of graph-like content – (Contextual) Rendering of LOD entities and tasks • Pricing and workers’ assignment – Can we connect the end-users of an application and their wish for specific data to be consumed with the payment of workers and prioritization of HITs? – Dealing with spam / gaming
  • 26. Thank you e: elena.simperl@kit.edu, t: @esimperl Publications available at www.insemtives.org Team: Maribel Acosta, Barry Norton, Katharina Siorpaes, Stefan Thaler, Stephan Wölger and many others