Automated evaluation of crowdsourced annotations in the cultural heritage domain

+
Automated Evaluation of Crowdsourced
Annotations in the Cultural Heritage Domain
Archana Nottamkandath, Jasper Oosterman, Davide Ceolin and Wan Fokkink
VU University Amsterdam and TU Delft, The Netherlands
1

+
Overview
 Project Overview
 Use case
 Research Questions
 Experiment
 Results
 Conclusion
2

+
Context
• COMMIT Project
– ICT Project in Netherlands
– Subprojects: SEALINCMedia and Data2semantics
• Socially Enriched Access to Linked Cultural Media
(SEALINCMedia)
– Collaboration with cultural heritage institutions to enrich their
collections and make them more accessible
3

+
Use case
 CH institutions have large collections which are poorly
annotated (Rijksmuseum Amsterdam: over 1 million items)
 Lack of sufficient resources: knowledge, cost, labor
 Solution: Crowd sourcing
4

+
Crowdsourcing Annotation Tasks
5
Roses
Annotator
From crowd
Garden
Provides
Annotations
Artefact
(Painting or Objects)
Car
Car
Garden
Roses
Evaluation

+
Annotation evaluation
 Manual evaluation is not feasible
 Institutions have large collections( Rijksmuseum: over 1 million)
 Crowd provides quite a lot of annotations
 Costs time and money
 Museums have limited resources
6

+
Need for automated algorithms
 Thus there is a need to develop algorithms to automatically evaluate
annotations with good accuracy
7

+
Previous approach
 Building user profile and tracking user reputation based on
semantic similarity
 Tracking provenance information for users
 Realized: There is lot of data provided and meaningful info
can be derived
 Current approach: Can we determine quality of information
based on features?
8

+
Research questions
 Can we evaluate annotations based on properties of the annotator
and the annotation?
 Can we predict reputation of annotator based on annotator
properties?
9
Roses
Age: 25
Male
Arts degree
No typo
Noun
In Wordnet

+
Relevant features
 Features of annotation
 Annotator
 Quality score
 Length
 Specificity…
 Features of annotator
 Age
 Gender
 Education
 Tagging experience…
10

+
Semantic Representation
11
Open Annotation model to represent annotation
Annotation
Target
oac:hasBody
Tag
User
oac:annotator
Reviewer Review
Review value
oac:annotates
oac:hasBody
oac:hasTarget
oac:annotation
foaf:person
rdf:type
foaf:age
age
gender
oac:annotates
length
oac:hasTarget
rdf:type
...
...
...
...
oac:annotator
rdf:type
rdf:type
ex:length
foaf:gender Used to estimate
FOAF to represent Annotator properties

+
Experiment
Steve.museum dataset
 We performed our evaluations on Steve.Museum dataset
 Online dataset of images and annotations
12
Stat features Values
Provided tags 45,733
Unique tags 13,949
Tags evaluated as useful 39,931(87%)
Tags evaluated as not-useful
5,802(13%)
Number of
annotators/registered
1218/488(40%)

+
Steve.museum annotation evaluation
 The annotations in Steve.museum project were evaluated into
multiple categories, we classified evaluations as either useful or not-useful
13
Usefulness-useful
Judgement-positive
Judgement-negative
Problematic-foreign
Problematic-typo
…
Usefulness-not useful

+
Identify relevant annotation properties
 Manually select properties (F_man)
 Is_adjective, is_english, in_wordnet
 List of all possible properties (F_all)
 F_man + [created_day/hour, length, specificty, nrwords, frequency]
 Apply feature selection algorithm on F_all to choose properties
(F_ml)
 Feature selection algorithm from WEKA toolkit
 WEKA is a collection of machine learning algorithms for data mining
tasks
 http://www.weka.net.nz/
14
Usefulness-useful

+
Build train and test data
 Split the Steve dataset annotations into test set and train set
 The train set has features and goal(quality)
 Test set: only the features
 Fairness: Train set had 1000 useful and 1000 not-useful annotations
15
Tag Feature
1
Feature
2
Feature n Quality
Rose f1 f2 fn Useful
House f11 f12 f1n Not-useful
Tag Feature 1 Feature 2 Feature n
Lily f1 f2 fn
Sky f11 f12 f1n
Train data
Test data

+
Machine learning
 Apply Machine learning techniques
 Learning: Learn about features and goal from training set
 Predictions: Apply learning from the training set to the test set
 Used SVM with default polykernel in WEKA to predict quality of
annotations
 Commonly used, fast and resistant against over-fitting
16

+
Results
 Method is good to predict useful tags, but not for predicting not-useful
tags
17
Feature
set
Class Recall Precisio
n
F-m
easure
F_man Useful 0.90 0.90 0.90
Not useful 0.20 0.21 0.20
F_all Useful 0.75 0.91 0.83
Not useful 0.42 0.18 0.25
F_ml Useful 0.20 0.98 0.34
Not useful 0.96 0.13 0.23

+
Identify relevant features of annotator
 Are these features helpful to
 Determine annotation quality?
 Predict annotator reputation?
18
Age: 25
Male
Arts degree

+
Building annotator reputation
 Probabilistic logic called Subjective Logic
 Annotator opinion =
 (belief, disbelief, uncertainty)
 (p,n) = (positive,negative) evaluations
 Belief = p/(p+n+2) Uncertainty = 2/(p+n+2)
 Expectation value(E) is the reputation
 E = (belief + apriori * uncertainty)
 Apriori = 0.5
19

+
Identify relevant annotator properties
 Manually identified properties
 F_man = [Community, age, education, experience, gender, tagging
experience…]
 List of all properties
 F_all = F_man + [vocabulary_size, vocab_diversity, is_anonymous, #
annotations in wordnet]
 Feature selection algorithm on F_all
 F_ml_a for annotation
 F_ml_u for annotator
20

+
Results
 Trained on features using SVM to make predictions
21
Feature
set
Class Recall Precisio
n
F-measure
F_man Useful 0.29 0.90 0.44
Not
useful
0.73 0.11 0.20
F_all Useful 0.69 0.91 0.78
Not
useful
0.43 0.15 0.22
F_ml_a Useful 0.55 0.91 0.68
Not
useful
0.53 0.13 0.21

+
Results
 Used regression to predict reputation values based on
features of registered annotator
 Since annotator reputation is highly skewed (90% > 0.7), we
could not predict reputation successfully
22
Feature_se
t
corr RMS
Error
Mean Abs Errr Rel Abs Err
F_man -0.02 0.15 0.10 97.8%
F_all 0.22 0.13 0.09 95.1%
F_ml_u 0.29 0.13 0.09 90.4%

+
Evaluation
 The possible reasons why method not successful for
predicting not-useful annotations:
 They are minority (13% of whole dataset)
 Need more in-depth analysis of features to determine not-useful
annotations
 Requires study from different datasets
23

+
Relevance
 Our experiments help to show that there is a correlation
between features of annotator and annotation to the quality
of annotations
 With a small set of features we were able to predict 98% of
the useful and 13% of the not useful annotations correctly.
 Helps to identify which features are relevant to certain tasks
24

+
Conclusions
 Machine learning techniques help to predict useful
evaluations but not not-useful ones
 Devised a model
 using SVM to predict annotation evaluation and annotator
reputation
 Using regression to predict annotator reputation
25

+
Future work
 Need to extract more in-depth information from both
annotation and annotator
 Need to build reputation of the annotator per topic
 Apply the model on different use cases
26

+ Thank you
a.nottamkandath@vu.nl
27

Automated evaluation of crowdsourced annotations in the cultural heritage domain

Recommended

Recommended

More Related Content

Similar to Automated evaluation of crowdsourced annotations in the cultural heritage domain

Similar to Automated evaluation of crowdsourced annotations in the cultural heritage domain (20)

Recently uploaded

Recently uploaded (20)

Automated evaluation of crowdsourced annotations in the cultural heritage domain

Editor's Notes