SlideShare a Scribd company logo
Human Computation for Big Data
Gianluca Demartini
eXascale Infolab
University of Fribourg, Switzerland
gianlucademartini.net
exascale.info
CUSO Seminar on Big Data – May 23, 2014 – Fribourg
Gianluca Demartini
• M.Sc. at University of Udine, Italy
• Ph.D. at University of Hannover, Germany
– Entity Retrieval
• Worked for UC Berkeley (on Crowdsourcing), Yahoo! Research
(Spain), L3S Research Center (Germany)
• Post-doc at the eXascale Infolab, Uni Fribourg, Switzerland.
• Lecturer for Social Computing in Fribourg
• Tutorial on Entity Search at ECIR 2012, on Crowdsourcing at
ESWC 2013 and ISWC 2013
• Research Interests
– Information Retrieval, Semantic Web, Human Computation
2
demartini@exascale.info
Gianluca Demartini
Web of Data
• Freebase
– Acquired by Google in July 2010.
– Knowledge Graph launched in May 2012.
• Schema.org
– Driven by major search engine companies
– Machine-readable annotations of Web pages
• Linked Open Data
– 31 billion triples, Sept. 2011
• Volume and Variety
Gianluca Demartini 3
Linked Open Data
Z. Kaoudi and I. Manolescu, ICDE seminar 2013 4
LOD data is an enormous graph
• Subject – Predicate – Object
– Barack Obama – marriedTo – Michelle Obama
• Specific scalable DB systems exist
Gianluca Demartini 5
e1
e2
e3
p1 p2
p3
e4
I will talk about
• Micro-task Crowdsourcing
• Hybrid Human-Machine systems
• Entity Linking/Disambiguation
– On the Web using crowdsourcing
• Improving Crowdsourcing Platform Quality
– Pushing tasks to workers
• Research directions
– Crowdsourced Query Understanding
– Transactive Search
Gianluca Demartini 6
Crowdsourcing
• Exploit human intelligence to solve
– Tasks simple for humans, complex for machines
– With a large number of humans (the Crowd)
– Small problems: micro-tasks (Amazon MTurk)
• Examples
– Wikipedia, Image tagging
• Incentives
– Financial, fun, visibility
Gianluca Demartini 7
Case-Study: Amazon MTurk
• Micro-task crowdsourcing marketplace
• On-demand, scalable, real-time workforce
• Different crowd motivation (not just money)
• Online since 2005 (still in “beta”)
• Currently the most popular platform
• Developer’s API as well as GUI
8Gianluca Demartini
Amazon MTurk
9Gianluca Demartini
A Task on MTurk
Gianluca Demartini 10
Amazon Mturk Workflow
• Requesters create tasks (HITs)
• Workers preview, accept, submit HITs
• Requesters approve, download results
11Gianluca Demartini
Example: Hybrid Image Search
Yan, Kumar, Ganesan, CrowdSearch: Exploiting Crowds for Accurate Real-time Image
Search on Mobile Phones, Mobisys 2010.
12Gianluca Demartini
Not sure
Example: Hybrid Data Integration
paper conf
Data integration VLDB-01
Data mining SIGMOD-02
title author email
OLAP Mike mike@a
Social media Jane jane@b
 Generate plausible matches
– paper = title, paper = author, paper = email, paper = venue
– conf = title, conf = author, conf = email, conf = venue
 Ask users to verify
paper conf
Data integration VLDB-01
Data mining SIGMOD-02
title author email venue
OLAP Mike mike@a ICDE-02
Social media Jane jane@b PODS-05
Does attribute paper match attribute author?
NoYes
McCann, Shen, Doan: Matching Schemas in Online Communities. ICDE, 2008 13
Hybrid Systems: Key Issues
• The role of machine (i.e., algorithm) and
humans
– use only humans? both? who’s doing what?
• Quality control
• Payment
• Optimization: What to crowdsource
• Scalability: How much to crowdsource
14Gianluca Demartini
Entity Linking/Disambiguation
Gianluca Demartini 16
http://dbpedia.org/resource/Facebook
http://dbpedia.org/resource/Instagram
fbase:Instagram
owl:sameAs
Google
Android
<p>Facebook is not waiting for its initial
public offering to make its first big
purchase.</p><p>In its largest
acquisition to date, the social network
has purchased Instagram, the popular
photo-sharing application, for about $1
billion in cash and stock, the company
said Monday.</p>
<p><span
about="http://dbpedia.org/resource/Facebook"><cit
e property=”rdfs:label">Facebook</cite> is not
waiting for its initial public offering to make its first
big purchase.</span></p><p><span
about="http://dbpedia.org/resource/Instagram">In
its largest acquisition to date, the social network has
purchased <cite
property=”rdfs:label">Instagram</cite> , the popular
photo-sharing application, for about $1 billion in cash
and stock, the company said Monday.</span></p>
RDFa
enrichment
HTML:
ZenCrowd
• Combine both algorithmic and manual linking
• Automate manual linking via crowdsourcing
• Dynamically assess human workers with a
probabilistic reasoning framework
17
Crowd
AlgorithmsMachines
Gianluca Demartini
ZenCrowd Architecture
Micro
Matching
Tasks
HTML
Pages
HTML+ RDFa
Pages
LOD Open Data Cloud
Crowdsourcing
Platform
Z enCrowd
Entity
Extractors
LOD Index Get Entity
Input Output
Probabilistic
Network
Decision Engine
Micro-
TaskManager
Workers Decisions
Algorithmic
Matchers
Gianluca Demartini 18
Gianluca Demartini, Djellel Eddine Difallah, and Philippe Cudré-Mauroux. ZenCrowd: Leveraging Probabilistic
Reasoning and Crowdsourcing Techniques for Large-Scale Entity Linking. In: 21st International Conference on
World Wide Web (WWW 2012).
Entity Factor Graphs
• Graph components
– Workers, links, clicks
– Prior probabilities
– Link Factors
– Constraints
• Probabilistic
Inference
– Select all links with
posterior prob >τ
w1
w2
l1
l2
pw1( ) pw2( )
lf1( ) lf2( )
pl1( ) pl2( )
l3
lf3( )
pl3( )
c11
c22
c12
c21
c13
c23
u2-3( )sa1-2( )
2 workers, 6 clicks, 3 candidate links
Link priors
Worker
priors
Observed
variables
Link
factors
SameAs
constraints
Dataset
Unicity
constraints
Gianluca Demartini 19
Experimental Evaluation
• Datasets
– 25 news articles from
• CNN.com (Global news)
• NYTimes.com (Global news)
• Washington-post.com (US local news)
• Timesofindia.indiatimes.com (India news)
• Swissinfo.com (Switzerland local news)
– 40M entities (Freebase, DBPedia, Geonames, NYT)
Gianluca Demartini 20
Worker Selection
Gianluca Demartini 21
Top US
Worker
0
0.5
1
0 250 500
WorkerPrecision
Number of Tasks
US Workers
IN Workers
0.6
0.62
0.64
0.66
0.68
0.7
0.72
0.74
0.76
0.78
0.8
1 2 3 4 5 6 7 8 9Precision
Top K workers
Lessons Learnt
• Crowdsourcing + Prob reasoning works!
• But
– Different worker communities perform differently
– Many low quality workers
– Completion time may vary (based on reward)
• Need to find the right workers for your task
(see WWW13 paper)
Gianluca Demartini 22
ZenCrowd Summary
• ZenCrowd: Probabilistic reasoning over automatic
and crowdsourcing methods for entity linking
• Standard crowdsourcing improves 6% over automatic
• 4% - 35% improvement over standard crowdsourcing
• 14% average improvement over automatic
approaches
http://exascale.info/zencrowd/
Gianluca Demartini 23
Blocking for Instance Matching
• Find the instances about the same real-world
entity within two datasets
• Avoid Comparison of all possible pairs
– Step 1: cluster similar items using a cheap
similarity measure
– Step 2: n*n comparison within the clusters with
an expensive measure
24Gianluca Demartini
Three-stage blocking with the Crowd
for Data Integration
• 1. Cheap clustering/inverted index selection of
candidates
• 2. Expensive similarity measure
• 3. Crowdsource low confidence matches
Gianluca Demartini 25
Gianluca Demartini, Djellel Eddine Difallah, and Philippe Cudré-Mauroux. Large-Scale Linked
Data Integration Using Probabilistic Reasoning and Crowdsourcing. In: VLDB Journal, Volume 22,
Issue 5 (2013), Page 665-687, Special issue on Structured, Social and Crowd-sourced Data on the
Web. October 2013.
Improving Crowdsourcing
Platforms
Gianluca Demartini 26
Pull (Traditional) Crowdsourcing
• In MTurk HITs are published on the market
• The first worker willing to do it can take it
• Pro: Fast
• Con: Not necessarily optimal / not the best
worker for the task
Gianluca Demartini 27
Push Crowdsourcing
• Pick-A-Crowd: A system architecture that uses
Task-to-Worker matching:
– The worker’s social profile
– The task context
• Workers can provide higher quality answers
on tasks they relate to
28
Djellel Eddine Difallah, Gianluca Demartini, and Philippe Cudré-Mauroux. Pick-A-Crowd: Tell Me
What You Like, and I'll Tell You What to Do. In: 22nd International Conference on World Wide
Web (WWW 2013), Rio de Janeiro, Brazil, May 2013.
Matching Models–
Expert Finding
• Build an inverted index on the pages’ titles and description
• Use the title/description of the tasks as a key word query on the
inverted index and get a subset of pages
• Rank the workers by the number of liked pages in the subset
29
Pick-A-Crowd
30
Discussion
• Pull vs. Push methodologies in Crowdsourcing
• Pick-A-Crowd system architecture with Task-
to-Worker recommendation
• Experimental comparison with AMT shows a
consistent quality improvement
“Workers Know what they Like”
31
www.openturk.com
OpenTurk
• Yet another a platform? Build on top of Mturk!
• Chrome Extension for push / notification
• 400+ users
• http://bit.ly/openturk-extension
• Open source:
https://github.com/openturk/extension
Gianluca Demartini 32
CrowdQ: Crowdsourced Query
Understanding
birthdate of the mayor of the capital city of italy
Gianluca Demartini 34
capital city of italy
Gianluca Demartini 35
mayor of rome
Gianluca Demartini 36
birthdate of ignazio marino
Gianluca Demartini 37
Motivation
• Web Search Engines can answer simple factual
queries directly on the result page
• Users with complex information needs are
often unsatisfied
• Purely automatic techniques are not enough
• We want to solve it with Crowdsourcing!
Gianluca Demartini 38
CrowdQ
• CrowdQ is the first system that uses
crowdsourcing to
– Understand the intended meaning
– Build a structured query template
– Answer the query over Linked Open Data
Gianluca Demartini 39
Gianluca Demartini, Beth Trushkowsky, Tim Kraska, and Michael Franklin. CrowdQ:
Crowdsourced Query Understanding. In: 6th Biennial Conference on Innovative Data Systems
Research (CIDR 2013).
Hybrid Human-Machine Pipeline
Gianluca Demartini 40
Q= birthdate of actors of forrest gump
Query annotation Noun Noun Named entity
Verification
Entity Relations
Is forrest gump this entity in the query?
Which is the relation between: actors and forrest gump starring
Schema element Starring <dbpedia-owl:starring>
Verification Is the relation between:
Indiana Jones – Harrison Ford
Back to the Future – Michael J. Fox
of the same type as
Forrest Gump - actors
Structured query generation
SELECT ?y ?x
WHERE { ?y <dbpedia-owl:birthdate> ?x .
?z <dbpedia-owl:starring> ?y .
?z <rdfs:label> ‘Forrest Gump’ }
Gianluca Demartini 41
Results from BTC09:
Q= birthdate of actors of forrest gump
Transactive Search
Gianluca Demartini 42
Transactive Search
• What if the data to answer your query is not
stored on any digital support?
• What if the data is just in people minds?
• Big Data No Data
Gianluca Demartini 43
Transactive Search
• Search using Transactive (group) Memories
• “Who attended the WWW 2014 conference?”
• Machines: Harvest the Web + Data Mining
• Crowd: Search twitter, look at event pictures
• Transactive Memories: Remember who I met
Gianluca Demartini 44
Michele Catasta, Alberto Tonon, Djellel Eddine Difallah, Gianluca Demartini, Karl Aberer, and
Philippe Cudré-Mauroux. Hippocampus: Answering Memory Queries using Transactive Search.
In: 23rd International Conference on World Wide Web (WWW 2014), Web Science Track. Seoul,
South Korea, April 2014.
Transactive Search (2)
Gianluca Demartini 45
Transactive Search (3)
Gianluca Demartini 46
Discussion
• Sometime data is not on the Web
• The right group of people can still answer
– Collaboratively
– Using Transactive Search
– Better than machines or anonymous crowds
• Open challenges
– Incentives
– Repeatability
– SNA
Gianluca Demartini 47
Research Directions for
Micro-task Crowdsourcing
Gianluca Demartini 48
State of Micro-task Crowdsourcing
• Platform side
– Pull platforms
– Batch processing
• Worker side
– Work flexibility
– Anonymity
• Requester side
– Web/API
Gianluca Demartini 49
The Future for Requesters
• Push Platforms
– RecSys, User Modeling, Trust
• Mobile Access
• Quality and Time guarantees
• Worker API (enable novel worker UI)
Gianluca Demartini 50
51
The Future of the Worker side
• Reputation system for workers
• More than financial incentives
• Recognize worker potential (badges)
– Paid for their expertise
• Train less skilled workers (tutoring system)
Aniket Kittur et al. The Future of Crowd Work.
CSCW 2013. Gianluca Demartini
Crowdsourcing Ethics
• People work full-time as crowd workers
• Chinese crowdsourcing platform with 5.5M workers
• Pros
– Help developing countries
– Provide cash fast to people == short-term satisfaction
– Job Flexibility
• Cons
– No job security
– No social security
– Long term satisfaction? Career plans?
52Gianluca Demartini
Dagstuhl Seminar on “Crowdsourcing: From Theory to Practice and Long-Term Perspectives”,
September 2013.
Conclusions
• Structured Data makes the Web better
• It’s growing fast
– Large volume
– Large heterogeneity
• Crowds can help understanding data semantics
• Hybrid human-machine systems (ZenCrowd)
• Research opportunities:
– Exploit Human Intelligence at Scale (CrowdQ)
– Pick the right crowd (Pick-A-Crowd, Transactive Search)
gianlucademartini.net
demartini@exascale.infoGianluca Demartini 53

More Related Content

What's hot

GA Project #4 Student Presentation - Platfora
GA Project #4 Student Presentation - PlatforaGA Project #4 Student Presentation - Platfora
GA Project #4 Student Presentation - Platfora
RGK Consulting and Photography
 
DeCAT 2015 - International Workshop on Deep Content Analytics Techniques for ...
DeCAT 2015 - International Workshop on Deep Content Analytics Techniques for ...DeCAT 2015 - International Workshop on Deep Content Analytics Techniques for ...
DeCAT 2015 - International Workshop on Deep Content Analytics Techniques for ...
Cataldo Musto
 
Elements of AI Luxembourg - session 5
Elements of AI Luxembourg - session 5Elements of AI Luxembourg - session 5
Elements of AI Luxembourg - session 5
Jeremie Dauphin
 
Tutorial Cognition - Irene
Tutorial Cognition - IreneTutorial Cognition - Irene
Tutorial Cognition - Irene
SSSW
 
Intro to Data Science Concepts
Intro to Data Science ConceptsIntro to Data Science Concepts
Intro to Data Science Concepts
University of Washington
 
Clarkson - Joshua White - Research Proposal Presentation
Clarkson - Joshua White - Research Proposal PresentationClarkson - Joshua White - Research Proposal Presentation
Clarkson - Joshua White - Research Proposal Presentation
Joshua S. White, PhD josh@securemind.org
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data Interaction
University of Washington
 
Urban Data Science at UW
Urban Data Science at UWUrban Data Science at UW
Urban Data Science at UW
University of Washington
 
Data dynamite presentation
Data dynamite presentationData dynamite presentation
Data dynamite presentation
W. David Stephenson
 
Open Government Data, Linked Data, and the Missing Blocks in Korea
Open Government Data, Linked Data, and the Missing Blocks in Korea Open Government Data, Linked Data, and the Missing Blocks in Korea
Open Government Data, Linked Data, and the Missing Blocks in Korea
Haklae Kim
 

What's hot (10)

GA Project #4 Student Presentation - Platfora
GA Project #4 Student Presentation - PlatforaGA Project #4 Student Presentation - Platfora
GA Project #4 Student Presentation - Platfora
 
DeCAT 2015 - International Workshop on Deep Content Analytics Techniques for ...
DeCAT 2015 - International Workshop on Deep Content Analytics Techniques for ...DeCAT 2015 - International Workshop on Deep Content Analytics Techniques for ...
DeCAT 2015 - International Workshop on Deep Content Analytics Techniques for ...
 
Elements of AI Luxembourg - session 5
Elements of AI Luxembourg - session 5Elements of AI Luxembourg - session 5
Elements of AI Luxembourg - session 5
 
Tutorial Cognition - Irene
Tutorial Cognition - IreneTutorial Cognition - Irene
Tutorial Cognition - Irene
 
Intro to Data Science Concepts
Intro to Data Science ConceptsIntro to Data Science Concepts
Intro to Data Science Concepts
 
Clarkson - Joshua White - Research Proposal Presentation
Clarkson - Joshua White - Research Proposal PresentationClarkson - Joshua White - Research Proposal Presentation
Clarkson - Joshua White - Research Proposal Presentation
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data Interaction
 
Urban Data Science at UW
Urban Data Science at UWUrban Data Science at UW
Urban Data Science at UW
 
Data dynamite presentation
Data dynamite presentationData dynamite presentation
Data dynamite presentation
 
Open Government Data, Linked Data, and the Missing Blocks in Korea
Open Government Data, Linked Data, and the Missing Blocks in Korea Open Government Data, Linked Data, and the Missing Blocks in Korea
Open Government Data, Linked Data, and the Missing Blocks in Korea
 

Similar to Human Computation for Big Data

Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data Management
eXascale Infolab
 
Not Your Mom's SEO
Not Your Mom's SEONot Your Mom's SEO
Not Your Mom's SEO
Marianne Sweeny
 
Workshop on Crowd-sourcing
Workshop on Crowd-sourcingWorkshop on Crowd-sourcing
Workshop on Crowd-sourcing
eXascale Infolab
 
Democratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data DiscoveryDemocratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data Discovery
Mark Grover
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discovery
markgrover
 
ISWC2013参加記
ISWC2013参加記ISWC2013参加記
Crowdsourcing: A Survey
Crowdsourcing: A SurveyCrowdsourcing: A Survey
Crowdsourcing: A Survey
IJERA Editor
 
Lecture4 Social Web
Lecture4 Social Web Lecture4 Social Web
Lecture4 Social Web
Marieke van Erp
 
Meetup SF - Amundsen
Meetup SF  -  AmundsenMeetup SF  -  Amundsen
Meetup SF - Amundsen
Philippe Mizrahi
 
Emerging Web Technologies October 2013
Emerging Web Technologies October 2013Emerging Web Technologies October 2013
Emerging Web Technologies October 2013
bthat
 
Tutorial: Social Semantic Web and Crowdsourcing - E. Simperl - ESWC SS 2014
Tutorial: Social Semantic Web and Crowdsourcing - E. Simperl - ESWC SS 2014 Tutorial: Social Semantic Web and Crowdsourcing - E. Simperl - ESWC SS 2014
Tutorial: Social Semantic Web and Crowdsourcing - E. Simperl - ESWC SS 2014
eswcsummerschool
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
Tao Feng
 
Entities, Graphs, and Crowdsourcing for better Web Search
Entities, Graphs, and Crowdsourcing for better Web SearchEntities, Graphs, and Crowdsourcing for better Web Search
Entities, Graphs, and Crowdsourcing for better Web Search
eXascale Infolab
 
Microtask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataMicrotask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked Data
EUCLID project
 
Smashing SIlos: UX is the New SEO
Smashing SIlos: UX is the New SEOSmashing SIlos: UX is the New SEO
Smashing SIlos: UX is the New SEO
BrightEdge
 
Seminar
SeminarSeminar
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedInDataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
Hakka Labs
 
Lecture 5: Mining, Analysis and Visualisation
Lecture 5: Mining, Analysis and VisualisationLecture 5: Mining, Analysis and Visualisation
Lecture 5: Mining, Analysis and Visualisation
Marieke van Erp
 
Visual and interactive storytelling slides cmg 2015-final
Visual and interactive storytelling slides    cmg 2015-finalVisual and interactive storytelling slides    cmg 2015-final
Visual and interactive storytelling slides cmg 2015-final
Katherine-CWACanada
 
talk for HK SME center about web3.0 , AI, mobile apps
talk for HK SME center about web3.0 , AI, mobile appstalk for HK SME center about web3.0 , AI, mobile apps
talk for HK SME center about web3.0 , AI, mobile apps
Alex Hung
 

Similar to Human Computation for Big Data (20)

Entity-Centric Data Management
Entity-Centric Data ManagementEntity-Centric Data Management
Entity-Centric Data Management
 
Not Your Mom's SEO
Not Your Mom's SEONot Your Mom's SEO
Not Your Mom's SEO
 
Workshop on Crowd-sourcing
Workshop on Crowd-sourcingWorkshop on Crowd-sourcing
Workshop on Crowd-sourcing
 
Democratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data DiscoveryDemocratizing Data within your organization - Data Discovery
Democratizing Data within your organization - Data Discovery
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discovery
 
ISWC2013参加記
ISWC2013参加記ISWC2013参加記
ISWC2013参加記
 
Crowdsourcing: A Survey
Crowdsourcing: A SurveyCrowdsourcing: A Survey
Crowdsourcing: A Survey
 
Lecture4 Social Web
Lecture4 Social Web Lecture4 Social Web
Lecture4 Social Web
 
Meetup SF - Amundsen
Meetup SF  -  AmundsenMeetup SF  -  Amundsen
Meetup SF - Amundsen
 
Emerging Web Technologies October 2013
Emerging Web Technologies October 2013Emerging Web Technologies October 2013
Emerging Web Technologies October 2013
 
Tutorial: Social Semantic Web and Crowdsourcing - E. Simperl - ESWC SS 2014
Tutorial: Social Semantic Web and Crowdsourcing - E. Simperl - ESWC SS 2014 Tutorial: Social Semantic Web and Crowdsourcing - E. Simperl - ESWC SS 2014
Tutorial: Social Semantic Web and Crowdsourcing - E. Simperl - ESWC SS 2014
 
Data council sf amundsen presentation
Data council sf    amundsen presentationData council sf    amundsen presentation
Data council sf amundsen presentation
 
Entities, Graphs, and Crowdsourcing for better Web Search
Entities, Graphs, and Crowdsourcing for better Web SearchEntities, Graphs, and Crowdsourcing for better Web Search
Entities, Graphs, and Crowdsourcing for better Web Search
 
Microtask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataMicrotask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked Data
 
Smashing SIlos: UX is the New SEO
Smashing SIlos: UX is the New SEOSmashing SIlos: UX is the New SEO
Smashing SIlos: UX is the New SEO
 
Seminar
SeminarSeminar
Seminar
 
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedInDataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
DataEngConf: Building Satori, a Hadoop toll for Data Extraction at LinkedIn
 
Lecture 5: Mining, Analysis and Visualisation
Lecture 5: Mining, Analysis and VisualisationLecture 5: Mining, Analysis and Visualisation
Lecture 5: Mining, Analysis and Visualisation
 
Visual and interactive storytelling slides cmg 2015-final
Visual and interactive storytelling slides    cmg 2015-finalVisual and interactive storytelling slides    cmg 2015-final
Visual and interactive storytelling slides cmg 2015-final
 
talk for HK SME center about web3.0 , AI, mobile apps
talk for HK SME center about web3.0 , AI, mobile appstalk for HK SME center about web3.0 , AI, mobile apps
talk for HK SME center about web3.0 , AI, mobile apps
 

More from eXascale Infolab

Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link PredictionBeyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
eXascale Infolab
 
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
eXascale Infolab
 
Representation Learning on Complex Graphs
Representation Learning on Complex GraphsRepresentation Learning on Complex Graphs
Representation Learning on Complex Graphs
eXascale Infolab
 
A force directed approach for offline gps trajectory map
A force directed approach for offline gps trajectory mapA force directed approach for offline gps trajectory map
A force directed approach for offline gps trajectory map
eXascale Infolab
 
Cikm 2018
Cikm 2018Cikm 2018
Cikm 2018
eXascale Infolab
 
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
eXascale Infolab
 
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
eXascale Infolab
 
Dependency-Driven Analytics: A Compass for Uncharted Data Oceans
Dependency-Driven Analytics: A Compass for Uncharted Data OceansDependency-Driven Analytics: A Compass for Uncharted Data Oceans
Dependency-Driven Analytics: A Compass for Uncharted Data Oceans
eXascale Infolab
 
Crowd scheduling www2016
Crowd scheduling www2016Crowd scheduling www2016
Crowd scheduling www2016
eXascale Infolab
 
SANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference ResolutionSANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference Resolution
eXascale Infolab
 
Efficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked DataEfficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked Data
eXascale Infolab
 
SSSW 2015 Sense Making
SSSW 2015 Sense MakingSSSW 2015 Sense Making
SSSW 2015 Sense Making
eXascale Infolab
 
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked DataLDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
eXascale Infolab
 
Executing Provenance-Enabled Queries over Web Data
Executing Provenance-Enabled Queries over Web DataExecuting Provenance-Enabled Queries over Web Data
Executing Provenance-Enabled Queries over Web Data
eXascale Infolab
 
The Dynamics of Micro-Task Crowdsourcing
The Dynamics of Micro-Task CrowdsourcingThe Dynamics of Micro-Task Crowdsourcing
The Dynamics of Micro-Task Crowdsourcing
eXascale Infolab
 
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...
eXascale Infolab
 
CIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition rankingCIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition ranking
eXascale Infolab
 
OLTP-Bench
OLTP-BenchOLTP-Bench
OLTP-Bench
eXascale Infolab
 
An Introduction to Big Data
An Introduction to Big DataAn Introduction to Big Data
An Introduction to Big Data
eXascale Infolab
 
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)
eXascale Infolab
 

More from eXascale Infolab (20)

Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link PredictionBeyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
 
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
 
Representation Learning on Complex Graphs
Representation Learning on Complex GraphsRepresentation Learning on Complex Graphs
Representation Learning on Complex Graphs
 
A force directed approach for offline gps trajectory map
A force directed approach for offline gps trajectory mapA force directed approach for offline gps trajectory map
A force directed approach for offline gps trajectory map
 
Cikm 2018
Cikm 2018Cikm 2018
Cikm 2018
 
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
 
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
 
Dependency-Driven Analytics: A Compass for Uncharted Data Oceans
Dependency-Driven Analytics: A Compass for Uncharted Data OceansDependency-Driven Analytics: A Compass for Uncharted Data Oceans
Dependency-Driven Analytics: A Compass for Uncharted Data Oceans
 
Crowd scheduling www2016
Crowd scheduling www2016Crowd scheduling www2016
Crowd scheduling www2016
 
SANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference ResolutionSANAPHOR: Ontology-based Coreference Resolution
SANAPHOR: Ontology-based Coreference Resolution
 
Efficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked DataEfficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked Data
 
SSSW 2015 Sense Making
SSSW 2015 Sense MakingSSSW 2015 Sense Making
SSSW 2015 Sense Making
 
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked DataLDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
 
Executing Provenance-Enabled Queries over Web Data
Executing Provenance-Enabled Queries over Web DataExecuting Provenance-Enabled Queries over Web Data
Executing Provenance-Enabled Queries over Web Data
 
The Dynamics of Micro-Task Crowdsourcing
The Dynamics of Micro-Task CrowdsourcingThe Dynamics of Micro-Task Crowdsourcing
The Dynamics of Micro-Task Crowdsourcing
 
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...
 
CIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition rankingCIKM14: Fixing grammatical errors by preposition ranking
CIKM14: Fixing grammatical errors by preposition ranking
 
OLTP-Bench
OLTP-BenchOLTP-Bench
OLTP-Bench
 
An Introduction to Big Data
An Introduction to Big DataAn Introduction to Big Data
An Introduction to Big Data
 
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)
 

Recently uploaded

The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
Sérgio Sacani
 
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
Advanced-Concepts-Team
 
LEARNING TO LIVE WITH LAWS OF MOTION .pptx
LEARNING TO LIVE WITH LAWS OF MOTION .pptxLEARNING TO LIVE WITH LAWS OF MOTION .pptx
LEARNING TO LIVE WITH LAWS OF MOTION .pptx
yourprojectpartner05
 
Gadgets for management of stored product pests_Dr.UPR.pdf
Gadgets for management of stored product pests_Dr.UPR.pdfGadgets for management of stored product pests_Dr.UPR.pdf
Gadgets for management of stored product pests_Dr.UPR.pdf
PirithiRaju
 
Holsinger, Bruce W. - Music, body and desire in medieval culture [2001].pdf
Holsinger, Bruce W. - Music, body and desire in medieval culture [2001].pdfHolsinger, Bruce W. - Music, body and desire in medieval culture [2001].pdf
Holsinger, Bruce W. - Music, body and desire in medieval culture [2001].pdf
frank0071
 
cathode ray oscilloscope and its applications
cathode ray oscilloscope and its applicationscathode ray oscilloscope and its applications
cathode ray oscilloscope and its applications
sandertein
 
The cost of acquiring information by natural selection
The cost of acquiring information by natural selectionThe cost of acquiring information by natural selection
The cost of acquiring information by natural selection
Carl Bergstrom
 
IMPORTANCE OF ALGAE AND ITS BENIFITS.pptx
IMPORTANCE OF ALGAE  AND ITS BENIFITS.pptxIMPORTANCE OF ALGAE  AND ITS BENIFITS.pptx
IMPORTANCE OF ALGAE AND ITS BENIFITS.pptx
OmAle5
 
Alternate Wetting and Drying - Climate Smart Agriculture
Alternate Wetting and Drying - Climate Smart AgricultureAlternate Wetting and Drying - Climate Smart Agriculture
Alternate Wetting and Drying - Climate Smart Agriculture
International Food Policy Research Institute- South Asia Office
 
Sciences of Europe journal No 142 (2024)
Sciences of Europe journal No 142 (2024)Sciences of Europe journal No 142 (2024)
Sciences of Europe journal No 142 (2024)
Sciences of Europe
 
Randomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNERandomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNE
University of Maribor
 
Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...
Leonel Morgado
 
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
PsychoTech Services
 
Direct Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart AgricultureDirect Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart Agriculture
International Food Policy Research Institute- South Asia Office
 
Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
University of Hertfordshire
 
Methods of grain storage Structures in India.pdf
Methods of grain storage Structures in India.pdfMethods of grain storage Structures in India.pdf
Methods of grain storage Structures in India.pdf
PirithiRaju
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
by6843629
 
Introduction_Ch_01_Biotech Biotechnology course .pptx
Introduction_Ch_01_Biotech Biotechnology course .pptxIntroduction_Ch_01_Biotech Biotechnology course .pptx
Introduction_Ch_01_Biotech Biotechnology course .pptx
QusayMaghayerh
 
Summary Of transcription and Translation.pdf
Summary Of transcription and Translation.pdfSummary Of transcription and Translation.pdf
Summary Of transcription and Translation.pdf
vadgavevedant86
 
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Leonel Morgado
 

Recently uploaded (20)

The binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defectsThe binding of cosmological structures by massless topological defects
The binding of cosmological structures by massless topological defects
 
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...
 
LEARNING TO LIVE WITH LAWS OF MOTION .pptx
LEARNING TO LIVE WITH LAWS OF MOTION .pptxLEARNING TO LIVE WITH LAWS OF MOTION .pptx
LEARNING TO LIVE WITH LAWS OF MOTION .pptx
 
Gadgets for management of stored product pests_Dr.UPR.pdf
Gadgets for management of stored product pests_Dr.UPR.pdfGadgets for management of stored product pests_Dr.UPR.pdf
Gadgets for management of stored product pests_Dr.UPR.pdf
 
Holsinger, Bruce W. - Music, body and desire in medieval culture [2001].pdf
Holsinger, Bruce W. - Music, body and desire in medieval culture [2001].pdfHolsinger, Bruce W. - Music, body and desire in medieval culture [2001].pdf
Holsinger, Bruce W. - Music, body and desire in medieval culture [2001].pdf
 
cathode ray oscilloscope and its applications
cathode ray oscilloscope and its applicationscathode ray oscilloscope and its applications
cathode ray oscilloscope and its applications
 
The cost of acquiring information by natural selection
The cost of acquiring information by natural selectionThe cost of acquiring information by natural selection
The cost of acquiring information by natural selection
 
IMPORTANCE OF ALGAE AND ITS BENIFITS.pptx
IMPORTANCE OF ALGAE  AND ITS BENIFITS.pptxIMPORTANCE OF ALGAE  AND ITS BENIFITS.pptx
IMPORTANCE OF ALGAE AND ITS BENIFITS.pptx
 
Alternate Wetting and Drying - Climate Smart Agriculture
Alternate Wetting and Drying - Climate Smart AgricultureAlternate Wetting and Drying - Climate Smart Agriculture
Alternate Wetting and Drying - Climate Smart Agriculture
 
Sciences of Europe journal No 142 (2024)
Sciences of Europe journal No 142 (2024)Sciences of Europe journal No 142 (2024)
Sciences of Europe journal No 142 (2024)
 
Randomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNERandomised Optimisation Algorithms in DAPHNE
Randomised Optimisation Algorithms in DAPHNE
 
Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...
 
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
Sexuality - Issues, Attitude and Behaviour - Applied Social Psychology - Psyc...
 
Direct Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart AgricultureDirect Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart Agriculture
 
Applied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdfApplied Science: Thermodynamics, Laws & Methodology.pdf
Applied Science: Thermodynamics, Laws & Methodology.pdf
 
Methods of grain storage Structures in India.pdf
Methods of grain storage Structures in India.pdfMethods of grain storage Structures in India.pdf
Methods of grain storage Structures in India.pdf
 
8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf8.Isolation of pure cultures and preservation of cultures.pdf
8.Isolation of pure cultures and preservation of cultures.pdf
 
Introduction_Ch_01_Biotech Biotechnology course .pptx
Introduction_Ch_01_Biotech Biotechnology course .pptxIntroduction_Ch_01_Biotech Biotechnology course .pptx
Introduction_Ch_01_Biotech Biotechnology course .pptx
 
Summary Of transcription and Translation.pdf
Summary Of transcription and Translation.pdfSummary Of transcription and Translation.pdf
Summary Of transcription and Translation.pdf
 
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
 

Human Computation for Big Data

  • 1. Human Computation for Big Data Gianluca Demartini eXascale Infolab University of Fribourg, Switzerland gianlucademartini.net exascale.info CUSO Seminar on Big Data – May 23, 2014 – Fribourg
  • 2. Gianluca Demartini • M.Sc. at University of Udine, Italy • Ph.D. at University of Hannover, Germany – Entity Retrieval • Worked for UC Berkeley (on Crowdsourcing), Yahoo! Research (Spain), L3S Research Center (Germany) • Post-doc at the eXascale Infolab, Uni Fribourg, Switzerland. • Lecturer for Social Computing in Fribourg • Tutorial on Entity Search at ECIR 2012, on Crowdsourcing at ESWC 2013 and ISWC 2013 • Research Interests – Information Retrieval, Semantic Web, Human Computation 2 demartini@exascale.info Gianluca Demartini
  • 3. Web of Data • Freebase – Acquired by Google in July 2010. – Knowledge Graph launched in May 2012. • Schema.org – Driven by major search engine companies – Machine-readable annotations of Web pages • Linked Open Data – 31 billion triples, Sept. 2011 • Volume and Variety Gianluca Demartini 3
  • 4. Linked Open Data Z. Kaoudi and I. Manolescu, ICDE seminar 2013 4
  • 5. LOD data is an enormous graph • Subject – Predicate – Object – Barack Obama – marriedTo – Michelle Obama • Specific scalable DB systems exist Gianluca Demartini 5 e1 e2 e3 p1 p2 p3 e4
  • 6. I will talk about • Micro-task Crowdsourcing • Hybrid Human-Machine systems • Entity Linking/Disambiguation – On the Web using crowdsourcing • Improving Crowdsourcing Platform Quality – Pushing tasks to workers • Research directions – Crowdsourced Query Understanding – Transactive Search Gianluca Demartini 6
  • 7. Crowdsourcing • Exploit human intelligence to solve – Tasks simple for humans, complex for machines – With a large number of humans (the Crowd) – Small problems: micro-tasks (Amazon MTurk) • Examples – Wikipedia, Image tagging • Incentives – Financial, fun, visibility Gianluca Demartini 7
  • 8. Case-Study: Amazon MTurk • Micro-task crowdsourcing marketplace • On-demand, scalable, real-time workforce • Different crowd motivation (not just money) • Online since 2005 (still in “beta”) • Currently the most popular platform • Developer’s API as well as GUI 8Gianluca Demartini
  • 10. A Task on MTurk Gianluca Demartini 10
  • 11. Amazon Mturk Workflow • Requesters create tasks (HITs) • Workers preview, accept, submit HITs • Requesters approve, download results 11Gianluca Demartini
  • 12. Example: Hybrid Image Search Yan, Kumar, Ganesan, CrowdSearch: Exploiting Crowds for Accurate Real-time Image Search on Mobile Phones, Mobisys 2010. 12Gianluca Demartini
  • 13. Not sure Example: Hybrid Data Integration paper conf Data integration VLDB-01 Data mining SIGMOD-02 title author email OLAP Mike mike@a Social media Jane jane@b  Generate plausible matches – paper = title, paper = author, paper = email, paper = venue – conf = title, conf = author, conf = email, conf = venue  Ask users to verify paper conf Data integration VLDB-01 Data mining SIGMOD-02 title author email venue OLAP Mike mike@a ICDE-02 Social media Jane jane@b PODS-05 Does attribute paper match attribute author? NoYes McCann, Shen, Doan: Matching Schemas in Online Communities. ICDE, 2008 13
  • 14. Hybrid Systems: Key Issues • The role of machine (i.e., algorithm) and humans – use only humans? both? who’s doing what? • Quality control • Payment • Optimization: What to crowdsource • Scalability: How much to crowdsource 14Gianluca Demartini
  • 16. Gianluca Demartini 16 http://dbpedia.org/resource/Facebook http://dbpedia.org/resource/Instagram fbase:Instagram owl:sameAs Google Android <p>Facebook is not waiting for its initial public offering to make its first big purchase.</p><p>In its largest acquisition to date, the social network has purchased Instagram, the popular photo-sharing application, for about $1 billion in cash and stock, the company said Monday.</p> <p><span about="http://dbpedia.org/resource/Facebook"><cit e property=”rdfs:label">Facebook</cite> is not waiting for its initial public offering to make its first big purchase.</span></p><p><span about="http://dbpedia.org/resource/Instagram">In its largest acquisition to date, the social network has purchased <cite property=”rdfs:label">Instagram</cite> , the popular photo-sharing application, for about $1 billion in cash and stock, the company said Monday.</span></p> RDFa enrichment HTML:
  • 17. ZenCrowd • Combine both algorithmic and manual linking • Automate manual linking via crowdsourcing • Dynamically assess human workers with a probabilistic reasoning framework 17 Crowd AlgorithmsMachines Gianluca Demartini
  • 18. ZenCrowd Architecture Micro Matching Tasks HTML Pages HTML+ RDFa Pages LOD Open Data Cloud Crowdsourcing Platform Z enCrowd Entity Extractors LOD Index Get Entity Input Output Probabilistic Network Decision Engine Micro- TaskManager Workers Decisions Algorithmic Matchers Gianluca Demartini 18 Gianluca Demartini, Djellel Eddine Difallah, and Philippe Cudré-Mauroux. ZenCrowd: Leveraging Probabilistic Reasoning and Crowdsourcing Techniques for Large-Scale Entity Linking. In: 21st International Conference on World Wide Web (WWW 2012).
  • 19. Entity Factor Graphs • Graph components – Workers, links, clicks – Prior probabilities – Link Factors – Constraints • Probabilistic Inference – Select all links with posterior prob >τ w1 w2 l1 l2 pw1( ) pw2( ) lf1( ) lf2( ) pl1( ) pl2( ) l3 lf3( ) pl3( ) c11 c22 c12 c21 c13 c23 u2-3( )sa1-2( ) 2 workers, 6 clicks, 3 candidate links Link priors Worker priors Observed variables Link factors SameAs constraints Dataset Unicity constraints Gianluca Demartini 19
  • 20. Experimental Evaluation • Datasets – 25 news articles from • CNN.com (Global news) • NYTimes.com (Global news) • Washington-post.com (US local news) • Timesofindia.indiatimes.com (India news) • Swissinfo.com (Switzerland local news) – 40M entities (Freebase, DBPedia, Geonames, NYT) Gianluca Demartini 20
  • 21. Worker Selection Gianluca Demartini 21 Top US Worker 0 0.5 1 0 250 500 WorkerPrecision Number of Tasks US Workers IN Workers 0.6 0.62 0.64 0.66 0.68 0.7 0.72 0.74 0.76 0.78 0.8 1 2 3 4 5 6 7 8 9Precision Top K workers
  • 22. Lessons Learnt • Crowdsourcing + Prob reasoning works! • But – Different worker communities perform differently – Many low quality workers – Completion time may vary (based on reward) • Need to find the right workers for your task (see WWW13 paper) Gianluca Demartini 22
  • 23. ZenCrowd Summary • ZenCrowd: Probabilistic reasoning over automatic and crowdsourcing methods for entity linking • Standard crowdsourcing improves 6% over automatic • 4% - 35% improvement over standard crowdsourcing • 14% average improvement over automatic approaches http://exascale.info/zencrowd/ Gianluca Demartini 23
  • 24. Blocking for Instance Matching • Find the instances about the same real-world entity within two datasets • Avoid Comparison of all possible pairs – Step 1: cluster similar items using a cheap similarity measure – Step 2: n*n comparison within the clusters with an expensive measure 24Gianluca Demartini
  • 25. Three-stage blocking with the Crowd for Data Integration • 1. Cheap clustering/inverted index selection of candidates • 2. Expensive similarity measure • 3. Crowdsource low confidence matches Gianluca Demartini 25 Gianluca Demartini, Djellel Eddine Difallah, and Philippe Cudré-Mauroux. Large-Scale Linked Data Integration Using Probabilistic Reasoning and Crowdsourcing. In: VLDB Journal, Volume 22, Issue 5 (2013), Page 665-687, Special issue on Structured, Social and Crowd-sourced Data on the Web. October 2013.
  • 27. Pull (Traditional) Crowdsourcing • In MTurk HITs are published on the market • The first worker willing to do it can take it • Pro: Fast • Con: Not necessarily optimal / not the best worker for the task Gianluca Demartini 27
  • 28. Push Crowdsourcing • Pick-A-Crowd: A system architecture that uses Task-to-Worker matching: – The worker’s social profile – The task context • Workers can provide higher quality answers on tasks they relate to 28 Djellel Eddine Difallah, Gianluca Demartini, and Philippe Cudré-Mauroux. Pick-A-Crowd: Tell Me What You Like, and I'll Tell You What to Do. In: 22nd International Conference on World Wide Web (WWW 2013), Rio de Janeiro, Brazil, May 2013.
  • 29. Matching Models– Expert Finding • Build an inverted index on the pages’ titles and description • Use the title/description of the tasks as a key word query on the inverted index and get a subset of pages • Rank the workers by the number of liked pages in the subset 29
  • 31. Discussion • Pull vs. Push methodologies in Crowdsourcing • Pick-A-Crowd system architecture with Task- to-Worker recommendation • Experimental comparison with AMT shows a consistent quality improvement “Workers Know what they Like” 31 www.openturk.com
  • 32. OpenTurk • Yet another a platform? Build on top of Mturk! • Chrome Extension for push / notification • 400+ users • http://bit.ly/openturk-extension • Open source: https://github.com/openturk/extension Gianluca Demartini 32
  • 34. birthdate of the mayor of the capital city of italy Gianluca Demartini 34
  • 35. capital city of italy Gianluca Demartini 35
  • 36. mayor of rome Gianluca Demartini 36
  • 37. birthdate of ignazio marino Gianluca Demartini 37
  • 38. Motivation • Web Search Engines can answer simple factual queries directly on the result page • Users with complex information needs are often unsatisfied • Purely automatic techniques are not enough • We want to solve it with Crowdsourcing! Gianluca Demartini 38
  • 39. CrowdQ • CrowdQ is the first system that uses crowdsourcing to – Understand the intended meaning – Build a structured query template – Answer the query over Linked Open Data Gianluca Demartini 39 Gianluca Demartini, Beth Trushkowsky, Tim Kraska, and Michael Franklin. CrowdQ: Crowdsourced Query Understanding. In: 6th Biennial Conference on Innovative Data Systems Research (CIDR 2013).
  • 40. Hybrid Human-Machine Pipeline Gianluca Demartini 40 Q= birthdate of actors of forrest gump Query annotation Noun Noun Named entity Verification Entity Relations Is forrest gump this entity in the query? Which is the relation between: actors and forrest gump starring Schema element Starring <dbpedia-owl:starring> Verification Is the relation between: Indiana Jones – Harrison Ford Back to the Future – Michael J. Fox of the same type as Forrest Gump - actors
  • 41. Structured query generation SELECT ?y ?x WHERE { ?y <dbpedia-owl:birthdate> ?x . ?z <dbpedia-owl:starring> ?y . ?z <rdfs:label> ‘Forrest Gump’ } Gianluca Demartini 41 Results from BTC09: Q= birthdate of actors of forrest gump
  • 43. Transactive Search • What if the data to answer your query is not stored on any digital support? • What if the data is just in people minds? • Big Data No Data Gianluca Demartini 43
  • 44. Transactive Search • Search using Transactive (group) Memories • “Who attended the WWW 2014 conference?” • Machines: Harvest the Web + Data Mining • Crowd: Search twitter, look at event pictures • Transactive Memories: Remember who I met Gianluca Demartini 44 Michele Catasta, Alberto Tonon, Djellel Eddine Difallah, Gianluca Demartini, Karl Aberer, and Philippe Cudré-Mauroux. Hippocampus: Answering Memory Queries using Transactive Search. In: 23rd International Conference on World Wide Web (WWW 2014), Web Science Track. Seoul, South Korea, April 2014.
  • 47. Discussion • Sometime data is not on the Web • The right group of people can still answer – Collaboratively – Using Transactive Search – Better than machines or anonymous crowds • Open challenges – Incentives – Repeatability – SNA Gianluca Demartini 47
  • 48. Research Directions for Micro-task Crowdsourcing Gianluca Demartini 48
  • 49. State of Micro-task Crowdsourcing • Platform side – Pull platforms – Batch processing • Worker side – Work flexibility – Anonymity • Requester side – Web/API Gianluca Demartini 49
  • 50. The Future for Requesters • Push Platforms – RecSys, User Modeling, Trust • Mobile Access • Quality and Time guarantees • Worker API (enable novel worker UI) Gianluca Demartini 50
  • 51. 51 The Future of the Worker side • Reputation system for workers • More than financial incentives • Recognize worker potential (badges) – Paid for their expertise • Train less skilled workers (tutoring system) Aniket Kittur et al. The Future of Crowd Work. CSCW 2013. Gianluca Demartini
  • 52. Crowdsourcing Ethics • People work full-time as crowd workers • Chinese crowdsourcing platform with 5.5M workers • Pros – Help developing countries – Provide cash fast to people == short-term satisfaction – Job Flexibility • Cons – No job security – No social security – Long term satisfaction? Career plans? 52Gianluca Demartini Dagstuhl Seminar on “Crowdsourcing: From Theory to Practice and Long-Term Perspectives”, September 2013.
  • 53. Conclusions • Structured Data makes the Web better • It’s growing fast – Large volume – Large heterogeneity • Crowds can help understanding data semantics • Hybrid human-machine systems (ZenCrowd) • Research opportunities: – Exploit Human Intelligence at Scale (CrowdQ) – Pick the right crowd (Pick-A-Crowd, Transactive Search) gianlucademartini.net demartini@exascale.infoGianluca Demartini 53

Editor's Notes

  1. 5min
  2. embarrassingly parallelizable
  3. 35min