Human Computation for Big Data
Gianluca Demartini
eXascale Infolab
University of Fribourg, Switzerland
gianlucademartini.n...
Gianluca Demartini
• M.Sc. at University of Udine, Italy
• Ph.D. at University of Hannover, Germany
– Entity Retrieval
• W...
Web of Data
• Freebase
– Acquired by Google in July 2010.
– Knowledge Graph launched in May 2012.
• Schema.org
– Driven by...
Linked Open Data
Z. Kaoudi and I. Manolescu, ICDE seminar 2013 4
LOD data is an enormous graph
• Subject – Predicate – Object
– Barack Obama – marriedTo – Michelle Obama
• Specific scalab...
I will talk about
• Micro-task Crowdsourcing
• Hybrid Human-Machine systems
• Entity Linking/Disambiguation
– On the Web u...
Crowdsourcing
• Exploit human intelligence to solve
– Tasks simple for humans, complex for machines
– With a large number ...
Case-Study: Amazon MTurk
• Micro-task crowdsourcing marketplace
• On-demand, scalable, real-time workforce
• Different cro...
Amazon MTurk
9Gianluca Demartini
A Task on MTurk
Gianluca Demartini 10
Amazon Mturk Workflow
• Requesters create tasks (HITs)
• Workers preview, accept, submit HITs
• Requesters approve, downlo...
Example: Hybrid Image Search
Yan, Kumar, Ganesan, CrowdSearch: Exploiting Crowds for Accurate Real-time Image
Search on Mo...
Not sure
Example: Hybrid Data Integration
paper conf
Data integration VLDB-01
Data mining SIGMOD-02
title author email
OLA...
Hybrid Systems: Key Issues
• The role of machine (i.e., algorithm) and
humans
– use only humans? both? who’s doing what?
•...
Entity Linking/Disambiguation
Gianluca Demartini 16
http://dbpedia.org/resource/Facebook
http://dbpedia.org/resource/Instagram
fbase:Instagram
owl:sameA...
ZenCrowd
• Combine both algorithmic and manual linking
• Automate manual linking via crowdsourcing
• Dynamically assess hu...
ZenCrowd Architecture
Micro
Matching
Tasks
HTML
Pages
HTML+ RDFa
Pages
LOD Open Data Cloud
Crowdsourcing
Platform
Z enCrow...
Entity Factor Graphs
• Graph components
– Workers, links, clicks
– Prior probabilities
– Link Factors
– Constraints
• Prob...
Experimental Evaluation
• Datasets
– 25 news articles from
• CNN.com (Global news)
• NYTimes.com (Global news)
• Washingto...
Worker Selection
Gianluca Demartini 21
Top US
Worker
0
0.5
1
0 250 500
WorkerPrecision
Number of Tasks
US Workers
IN Worke...
Lessons Learnt
• Crowdsourcing + Prob reasoning works!
• But
– Different worker communities perform differently
– Many low...
ZenCrowd Summary
• ZenCrowd: Probabilistic reasoning over automatic
and crowdsourcing methods for entity linking
• Standar...
Blocking for Instance Matching
• Find the instances about the same real-world
entity within two datasets
• Avoid Compariso...
Three-stage blocking with the Crowd
for Data Integration
• 1. Cheap clustering/inverted index selection of
candidates
• 2....
Improving Crowdsourcing
Platforms
Gianluca Demartini 26
Pull (Traditional) Crowdsourcing
• In MTurk HITs are published on the market
• The first worker willing to do it can take ...
Push Crowdsourcing
• Pick-A-Crowd: A system architecture that uses
Task-to-Worker matching:
– The worker’s social profile
...
Matching Models–
Expert Finding
• Build an inverted index on the pages’ titles and description
• Use the title/description...
Pick-A-Crowd
30
Discussion
• Pull vs. Push methodologies in Crowdsourcing
• Pick-A-Crowd system architecture with Task-
to-Worker recommen...
OpenTurk
• Yet another a platform? Build on top of Mturk!
• Chrome Extension for push / notification
• 400+ users
• http:/...
CrowdQ: Crowdsourced Query
Understanding
birthdate of the mayor of the capital city of italy
Gianluca Demartini 34
capital city of italy
Gianluca Demartini 35
mayor of rome
Gianluca Demartini 36
birthdate of ignazio marino
Gianluca Demartini 37
Motivation
• Web Search Engines can answer simple factual
queries directly on the result page
• Users with complex informa...
CrowdQ
• CrowdQ is the first system that uses
crowdsourcing to
– Understand the intended meaning
– Build a structured quer...
Hybrid Human-Machine Pipeline
Gianluca Demartini 40
Q= birthdate of actors of forrest gump
Query annotation Noun Noun Name...
Structured query generation
SELECT ?y ?x
WHERE { ?y <dbpedia-owl:birthdate> ?x .
?z <dbpedia-owl:starring> ?y .
?z <rdfs:l...
Transactive Search
Gianluca Demartini 42
Transactive Search
• What if the data to answer your query is not
stored on any digital support?
• What if the data is jus...
Transactive Search
• Search using Transactive (group) Memories
• “Who attended the WWW 2014 conference?”
• Machines: Harve...
Transactive Search (2)
Gianluca Demartini 45
Transactive Search (3)
Gianluca Demartini 46
Discussion
• Sometime data is not on the Web
• The right group of people can still answer
– Collaboratively
– Using Transa...
Research Directions for
Micro-task Crowdsourcing
Gianluca Demartini 48
State of Micro-task Crowdsourcing
• Platform side
– Pull platforms
– Batch processing
• Worker side
– Work flexibility
– A...
The Future for Requesters
• Push Platforms
– RecSys, User Modeling, Trust
• Mobile Access
• Quality and Time guarantees
• ...
51
The Future of the Worker side
• Reputation system for workers
• More than financial incentives
• Recognize worker poten...
Crowdsourcing Ethics
• People work full-time as crowd workers
• Chinese crowdsourcing platform with 5.5M workers
• Pros
– ...
Conclusions
• Structured Data makes the Web better
• It’s growing fast
– Large volume
– Large heterogeneity
• Crowds can h...
Upcoming SlideShare
Loading in …5
×

Human Computation for Big Data

891 views

Published on

Over the last few years we have observed the emergence of hybrid human-machine information systems which are able to both scale over large amount of data as well as to maintain high-quality data processing intrinsic in human intelligence.

In this talk I will focus on the use of human intelligence at scale by means of crowdsourcing to deal with Big Data problems. We will look specifically on how to deal with the variety in data by means of Human Computation still being able to operate with a large data volume.

First, I will introduce the area of micro-task crowdsourcing also providing an overview of different research challenges that needs to be tackled to enable large-scale hybrid human-machine information systems. Next, I will provide examples of such hybrid systems for entity linking and disambiguation using crowdsourcing and a graph of linked entities as background corpus. I will describe how keyword query understanding can be crowdsourced to build search engines that can answer rare complex queries. Finally, I will present new techniques that allow to improve the quality of crowdsourced information system components by means of push crowdsourcing.

Published in: Science, Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
891
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
20
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • 5min
  • embarrassingly parallelizable
  • 35min
  • Human Computation for Big Data

    1. 1. Human Computation for Big Data Gianluca Demartini eXascale Infolab University of Fribourg, Switzerland gianlucademartini.net exascale.info CUSO Seminar on Big Data – May 23, 2014 – Fribourg
    2. 2. Gianluca Demartini • M.Sc. at University of Udine, Italy • Ph.D. at University of Hannover, Germany – Entity Retrieval • Worked for UC Berkeley (on Crowdsourcing), Yahoo! Research (Spain), L3S Research Center (Germany) • Post-doc at the eXascale Infolab, Uni Fribourg, Switzerland. • Lecturer for Social Computing in Fribourg • Tutorial on Entity Search at ECIR 2012, on Crowdsourcing at ESWC 2013 and ISWC 2013 • Research Interests – Information Retrieval, Semantic Web, Human Computation 2 demartini@exascale.info Gianluca Demartini
    3. 3. Web of Data • Freebase – Acquired by Google in July 2010. – Knowledge Graph launched in May 2012. • Schema.org – Driven by major search engine companies – Machine-readable annotations of Web pages • Linked Open Data – 31 billion triples, Sept. 2011 • Volume and Variety Gianluca Demartini 3
    4. 4. Linked Open Data Z. Kaoudi and I. Manolescu, ICDE seminar 2013 4
    5. 5. LOD data is an enormous graph • Subject – Predicate – Object – Barack Obama – marriedTo – Michelle Obama • Specific scalable DB systems exist Gianluca Demartini 5 e1 e2 e3 p1 p2 p3 e4
    6. 6. I will talk about • Micro-task Crowdsourcing • Hybrid Human-Machine systems • Entity Linking/Disambiguation – On the Web using crowdsourcing • Improving Crowdsourcing Platform Quality – Pushing tasks to workers • Research directions – Crowdsourced Query Understanding – Transactive Search Gianluca Demartini 6
    7. 7. Crowdsourcing • Exploit human intelligence to solve – Tasks simple for humans, complex for machines – With a large number of humans (the Crowd) – Small problems: micro-tasks (Amazon MTurk) • Examples – Wikipedia, Image tagging • Incentives – Financial, fun, visibility Gianluca Demartini 7
    8. 8. Case-Study: Amazon MTurk • Micro-task crowdsourcing marketplace • On-demand, scalable, real-time workforce • Different crowd motivation (not just money) • Online since 2005 (still in “beta”) • Currently the most popular platform • Developer’s API as well as GUI 8Gianluca Demartini
    9. 9. Amazon MTurk 9Gianluca Demartini
    10. 10. A Task on MTurk Gianluca Demartini 10
    11. 11. Amazon Mturk Workflow • Requesters create tasks (HITs) • Workers preview, accept, submit HITs • Requesters approve, download results 11Gianluca Demartini
    12. 12. Example: Hybrid Image Search Yan, Kumar, Ganesan, CrowdSearch: Exploiting Crowds for Accurate Real-time Image Search on Mobile Phones, Mobisys 2010. 12Gianluca Demartini
    13. 13. Not sure Example: Hybrid Data Integration paper conf Data integration VLDB-01 Data mining SIGMOD-02 title author email OLAP Mike mike@a Social media Jane jane@b  Generate plausible matches – paper = title, paper = author, paper = email, paper = venue – conf = title, conf = author, conf = email, conf = venue  Ask users to verify paper conf Data integration VLDB-01 Data mining SIGMOD-02 title author email venue OLAP Mike mike@a ICDE-02 Social media Jane jane@b PODS-05 Does attribute paper match attribute author? NoYes McCann, Shen, Doan: Matching Schemas in Online Communities. ICDE, 2008 13
    14. 14. Hybrid Systems: Key Issues • The role of machine (i.e., algorithm) and humans – use only humans? both? who’s doing what? • Quality control • Payment • Optimization: What to crowdsource • Scalability: How much to crowdsource 14Gianluca Demartini
    15. 15. Entity Linking/Disambiguation
    16. 16. Gianluca Demartini 16 http://dbpedia.org/resource/Facebook http://dbpedia.org/resource/Instagram fbase:Instagram owl:sameAs Google Android <p>Facebook is not waiting for its initial public offering to make its first big purchase.</p><p>In its largest acquisition to date, the social network has purchased Instagram, the popular photo-sharing application, for about $1 billion in cash and stock, the company said Monday.</p> <p><span about="http://dbpedia.org/resource/Facebook"><cit e property=”rdfs:label">Facebook</cite> is not waiting for its initial public offering to make its first big purchase.</span></p><p><span about="http://dbpedia.org/resource/Instagram">In its largest acquisition to date, the social network has purchased <cite property=”rdfs:label">Instagram</cite> , the popular photo-sharing application, for about $1 billion in cash and stock, the company said Monday.</span></p> RDFa enrichment HTML:
    17. 17. ZenCrowd • Combine both algorithmic and manual linking • Automate manual linking via crowdsourcing • Dynamically assess human workers with a probabilistic reasoning framework 17 Crowd AlgorithmsMachines Gianluca Demartini
    18. 18. ZenCrowd Architecture Micro Matching Tasks HTML Pages HTML+ RDFa Pages LOD Open Data Cloud Crowdsourcing Platform Z enCrowd Entity Extractors LOD Index Get Entity Input Output Probabilistic Network Decision Engine Micro- TaskManager Workers Decisions Algorithmic Matchers Gianluca Demartini 18 Gianluca Demartini, Djellel Eddine Difallah, and Philippe Cudré-Mauroux. ZenCrowd: Leveraging Probabilistic Reasoning and Crowdsourcing Techniques for Large-Scale Entity Linking. In: 21st International Conference on World Wide Web (WWW 2012).
    19. 19. Entity Factor Graphs • Graph components – Workers, links, clicks – Prior probabilities – Link Factors – Constraints • Probabilistic Inference – Select all links with posterior prob >τ w1 w2 l1 l2 pw1( ) pw2( ) lf1( ) lf2( ) pl1( ) pl2( ) l3 lf3( ) pl3( ) c11 c22 c12 c21 c13 c23 u2-3( )sa1-2( ) 2 workers, 6 clicks, 3 candidate links Link priors Worker priors Observed variables Link factors SameAs constraints Dataset Unicity constraints Gianluca Demartini 19
    20. 20. Experimental Evaluation • Datasets – 25 news articles from • CNN.com (Global news) • NYTimes.com (Global news) • Washington-post.com (US local news) • Timesofindia.indiatimes.com (India news) • Swissinfo.com (Switzerland local news) – 40M entities (Freebase, DBPedia, Geonames, NYT) Gianluca Demartini 20
    21. 21. Worker Selection Gianluca Demartini 21 Top US Worker 0 0.5 1 0 250 500 WorkerPrecision Number of Tasks US Workers IN Workers 0.6 0.62 0.64 0.66 0.68 0.7 0.72 0.74 0.76 0.78 0.8 1 2 3 4 5 6 7 8 9Precision Top K workers
    22. 22. Lessons Learnt • Crowdsourcing + Prob reasoning works! • But – Different worker communities perform differently – Many low quality workers – Completion time may vary (based on reward) • Need to find the right workers for your task (see WWW13 paper) Gianluca Demartini 22
    23. 23. ZenCrowd Summary • ZenCrowd: Probabilistic reasoning over automatic and crowdsourcing methods for entity linking • Standard crowdsourcing improves 6% over automatic • 4% - 35% improvement over standard crowdsourcing • 14% average improvement over automatic approaches http://exascale.info/zencrowd/ Gianluca Demartini 23
    24. 24. Blocking for Instance Matching • Find the instances about the same real-world entity within two datasets • Avoid Comparison of all possible pairs – Step 1: cluster similar items using a cheap similarity measure – Step 2: n*n comparison within the clusters with an expensive measure 24Gianluca Demartini
    25. 25. Three-stage blocking with the Crowd for Data Integration • 1. Cheap clustering/inverted index selection of candidates • 2. Expensive similarity measure • 3. Crowdsource low confidence matches Gianluca Demartini 25 Gianluca Demartini, Djellel Eddine Difallah, and Philippe Cudré-Mauroux. Large-Scale Linked Data Integration Using Probabilistic Reasoning and Crowdsourcing. In: VLDB Journal, Volume 22, Issue 5 (2013), Page 665-687, Special issue on Structured, Social and Crowd-sourced Data on the Web. October 2013.
    26. 26. Improving Crowdsourcing Platforms Gianluca Demartini 26
    27. 27. Pull (Traditional) Crowdsourcing • In MTurk HITs are published on the market • The first worker willing to do it can take it • Pro: Fast • Con: Not necessarily optimal / not the best worker for the task Gianluca Demartini 27
    28. 28. Push Crowdsourcing • Pick-A-Crowd: A system architecture that uses Task-to-Worker matching: – The worker’s social profile – The task context • Workers can provide higher quality answers on tasks they relate to 28 Djellel Eddine Difallah, Gianluca Demartini, and Philippe Cudré-Mauroux. Pick-A-Crowd: Tell Me What You Like, and I'll Tell You What to Do. In: 22nd International Conference on World Wide Web (WWW 2013), Rio de Janeiro, Brazil, May 2013.
    29. 29. Matching Models– Expert Finding • Build an inverted index on the pages’ titles and description • Use the title/description of the tasks as a key word query on the inverted index and get a subset of pages • Rank the workers by the number of liked pages in the subset 29
    30. 30. Pick-A-Crowd 30
    31. 31. Discussion • Pull vs. Push methodologies in Crowdsourcing • Pick-A-Crowd system architecture with Task- to-Worker recommendation • Experimental comparison with AMT shows a consistent quality improvement “Workers Know what they Like” 31 www.openturk.com
    32. 32. OpenTurk • Yet another a platform? Build on top of Mturk! • Chrome Extension for push / notification • 400+ users • http://bit.ly/openturk-extension • Open source: https://github.com/openturk/extension Gianluca Demartini 32
    33. 33. CrowdQ: Crowdsourced Query Understanding
    34. 34. birthdate of the mayor of the capital city of italy Gianluca Demartini 34
    35. 35. capital city of italy Gianluca Demartini 35
    36. 36. mayor of rome Gianluca Demartini 36
    37. 37. birthdate of ignazio marino Gianluca Demartini 37
    38. 38. Motivation • Web Search Engines can answer simple factual queries directly on the result page • Users with complex information needs are often unsatisfied • Purely automatic techniques are not enough • We want to solve it with Crowdsourcing! Gianluca Demartini 38
    39. 39. CrowdQ • CrowdQ is the first system that uses crowdsourcing to – Understand the intended meaning – Build a structured query template – Answer the query over Linked Open Data Gianluca Demartini 39 Gianluca Demartini, Beth Trushkowsky, Tim Kraska, and Michael Franklin. CrowdQ: Crowdsourced Query Understanding. In: 6th Biennial Conference on Innovative Data Systems Research (CIDR 2013).
    40. 40. Hybrid Human-Machine Pipeline Gianluca Demartini 40 Q= birthdate of actors of forrest gump Query annotation Noun Noun Named entity Verification Entity Relations Is forrest gump this entity in the query? Which is the relation between: actors and forrest gump starring Schema element Starring <dbpedia-owl:starring> Verification Is the relation between: Indiana Jones – Harrison Ford Back to the Future – Michael J. Fox of the same type as Forrest Gump - actors
    41. 41. Structured query generation SELECT ?y ?x WHERE { ?y <dbpedia-owl:birthdate> ?x . ?z <dbpedia-owl:starring> ?y . ?z <rdfs:label> ‘Forrest Gump’ } Gianluca Demartini 41 Results from BTC09: Q= birthdate of actors of forrest gump
    42. 42. Transactive Search Gianluca Demartini 42
    43. 43. Transactive Search • What if the data to answer your query is not stored on any digital support? • What if the data is just in people minds? • Big Data No Data Gianluca Demartini 43
    44. 44. Transactive Search • Search using Transactive (group) Memories • “Who attended the WWW 2014 conference?” • Machines: Harvest the Web + Data Mining • Crowd: Search twitter, look at event pictures • Transactive Memories: Remember who I met Gianluca Demartini 44 Michele Catasta, Alberto Tonon, Djellel Eddine Difallah, Gianluca Demartini, Karl Aberer, and Philippe Cudré-Mauroux. Hippocampus: Answering Memory Queries using Transactive Search. In: 23rd International Conference on World Wide Web (WWW 2014), Web Science Track. Seoul, South Korea, April 2014.
    45. 45. Transactive Search (2) Gianluca Demartini 45
    46. 46. Transactive Search (3) Gianluca Demartini 46
    47. 47. Discussion • Sometime data is not on the Web • The right group of people can still answer – Collaboratively – Using Transactive Search – Better than machines or anonymous crowds • Open challenges – Incentives – Repeatability – SNA Gianluca Demartini 47
    48. 48. Research Directions for Micro-task Crowdsourcing Gianluca Demartini 48
    49. 49. State of Micro-task Crowdsourcing • Platform side – Pull platforms – Batch processing • Worker side – Work flexibility – Anonymity • Requester side – Web/API Gianluca Demartini 49
    50. 50. The Future for Requesters • Push Platforms – RecSys, User Modeling, Trust • Mobile Access • Quality and Time guarantees • Worker API (enable novel worker UI) Gianluca Demartini 50
    51. 51. 51 The Future of the Worker side • Reputation system for workers • More than financial incentives • Recognize worker potential (badges) – Paid for their expertise • Train less skilled workers (tutoring system) Aniket Kittur et al. The Future of Crowd Work. CSCW 2013. Gianluca Demartini
    52. 52. Crowdsourcing Ethics • People work full-time as crowd workers • Chinese crowdsourcing platform with 5.5M workers • Pros – Help developing countries – Provide cash fast to people == short-term satisfaction – Job Flexibility • Cons – No job security – No social security – Long term satisfaction? Career plans? 52Gianluca Demartini Dagstuhl Seminar on “Crowdsourcing: From Theory to Practice and Long-Term Perspectives”, September 2013.
    53. 53. Conclusions • Structured Data makes the Web better • It’s growing fast – Large volume – Large heterogeneity • Crowds can help understanding data semantics • Hybrid human-machine systems (ZenCrowd) • Research opportunities: – Exploit Human Intelligence at Scale (CrowdQ) – Pick the right crowd (Pick-A-Crowd, Transactive Search) gianlucademartini.net demartini@exascale.infoGianluca Demartini 53

    ×